Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-19 Thread Michael Ellerman
Tejun Heo  writes:

> Hello, Michael.
>
> On Tue, Oct 18, 2016 at 03:37:42PM +1100, Michael Ellerman wrote:
>> That doesn't compile, wq doesn't exist.
>> 
>> I guessed that you meant:
>> 
>> +   wq_numa_init();
>> +   list_for_each_entry(wq, , list)
>> +   wq_update_unbound_numa(wq, smp_processor_id(), true);
>
> Yeap, sorry about that.

No worries.

>> And that does boot.
>> 
>> The sysrq-t output is below, it's rather large.
>
> Didn't expect that many cpus but it looks good.

I have another system here with twice as many CPUs if that would help ;)

> I'll post proper patches soon.

Thanks. Can you pull the current series out of linux-next for now until
you merge the new series?

cheers


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-19 Thread Michael Ellerman
Tejun Heo  writes:

> Hello, Michael.
>
> On Tue, Oct 18, 2016 at 03:37:42PM +1100, Michael Ellerman wrote:
>> That doesn't compile, wq doesn't exist.
>> 
>> I guessed that you meant:
>> 
>> +   wq_numa_init();
>> +   list_for_each_entry(wq, , list)
>> +   wq_update_unbound_numa(wq, smp_processor_id(), true);
>
> Yeap, sorry about that.

No worries.

>> And that does boot.
>> 
>> The sysrq-t output is below, it's rather large.
>
> Didn't expect that many cpus but it looks good.

I have another system here with twice as many CPUs if that would help ;)

> I'll post proper patches soon.

Thanks. Can you pull the current series out of linux-next for now until
you merge the new series?

cheers


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-18 Thread Tejun Heo
Hello, Michael.

On Tue, Oct 18, 2016 at 03:37:42PM +1100, Michael Ellerman wrote:
> That doesn't compile, wq doesn't exist.
> 
> I guessed that you meant:
> 
> +   wq_numa_init();
> +   list_for_each_entry(wq, , list)
> +   wq_update_unbound_numa(wq, smp_processor_id(), true);

Yeap, sorry about that.

> And that does boot.
> 
> The sysrq-t output is below, it's rather large.

Didn't expect that many cpus but it looks good.  I'll post proper
patches soon.

Thanks!

-- 
tejun


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-18 Thread Tejun Heo
Hello, Michael.

On Tue, Oct 18, 2016 at 03:37:42PM +1100, Michael Ellerman wrote:
> That doesn't compile, wq doesn't exist.
> 
> I guessed that you meant:
> 
> +   wq_numa_init();
> +   list_for_each_entry(wq, , list)
> +   wq_update_unbound_numa(wq, smp_processor_id(), true);

Yeap, sorry about that.

> And that does boot.
> 
> The sysrq-t output is below, it's rather large.

Didn't expect that many cpus but it looks good.  I'll post proper
patches soon.

Thanks!

-- 
tejun


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-17 Thread Michael Ellerman
Balbir Singh  writes:
> On 17/10/16 23:24, Michael Ellerman wrote:
>> That happened because we haven't yet called set_cpu_numa_node() for the 
>> non-boot
>> cpus, because that happens in smp_prepare_cpus(), and
>> workqueue_init_early() is called much earlier than that.
>> 
>> This doesn't trigger on x86 because it does set_cpu_numa_node() in
>> setup_per_cpu_areas(), which is called prior to workqueue_init_early().
>> 
>> We can (should) probably do the same on powerpc, I'll look at that
>> tomorrow. But other arches may have a similar problem, and at the very
>> least we need to document that workqueue_init_early() relies on
>> cpu_to_node() working.
>
> Don't we do the setup cpu->node mapings in initmem_init()?
> Ideally we have setup_arch->intmem_init->numa_setup_cpu

That sets up numa_cpu_lookup_table, which is a powerpc only data
structure.

But it doesn't setup the percpu numa_node variables, used by
cpu_to_node(), because percpu areas are not setup yet.

cheers


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-17 Thread Michael Ellerman
Balbir Singh  writes:
> On 17/10/16 23:24, Michael Ellerman wrote:
>> That happened because we haven't yet called set_cpu_numa_node() for the 
>> non-boot
>> cpus, because that happens in smp_prepare_cpus(), and
>> workqueue_init_early() is called much earlier than that.
>> 
>> This doesn't trigger on x86 because it does set_cpu_numa_node() in
>> setup_per_cpu_areas(), which is called prior to workqueue_init_early().
>> 
>> We can (should) probably do the same on powerpc, I'll look at that
>> tomorrow. But other arches may have a similar problem, and at the very
>> least we need to document that workqueue_init_early() relies on
>> cpu_to_node() working.
>
> Don't we do the setup cpu->node mapings in initmem_init()?
> Ideally we have setup_arch->intmem_init->numa_setup_cpu

That sets up numa_cpu_lookup_table, which is a powerpc only data
structure.

But it doesn't setup the percpu numa_node variables, used by
cpu_to_node(), because percpu areas are not setup yet.

cheers


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-17 Thread Tejun Heo
Hello, Michael.

Other NUMA archs are lazy-initializing cpu to node mapping too, so we
need to fix it from workqueue side.  This also means that we've been
getting NUMA node wrong for percpu pools on those archs.

Can you please try the following patch and if it resolves the issue,
report the workqueue part (it's at the end) of sysrq-t dump?

Thanks.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 984f6ff..276557b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4411,14 +4411,14 @@ void show_workqueue_state(void)
break;
}
}
-   if (idle)
-   continue;
+   //if (idle)
+   //  continue;
 
pr_info("workqueue %s: flags=0x%x\n", wq->name, wq->flags);
 
for_each_pwq(pwq, wq) {
spin_lock_irqsave(>pool->lock, flags);
-   if (pwq->nr_active || !list_empty(>delayed_works))
+   //if (pwq->nr_active || 
!list_empty(>delayed_works))
show_pwq(pwq);
spin_unlock_irqrestore(>pool->lock, flags);
}
@@ -4429,8 +4429,8 @@ void show_workqueue_state(void)
bool first = true;
 
spin_lock_irqsave(>lock, flags);
-   if (pool->nr_workers == pool->nr_idle)
-   goto next_pool;
+   //if (pool->nr_workers == pool->nr_idle)
+   //  goto next_pool;
 
pr_info("pool %d:", pool->id);
pr_cont_pool_info(pool);
@@ -4649,10 +4649,12 @@ int workqueue_online_cpu(unsigned int cpu)
for_each_pool(pool, pi) {
mutex_lock(>attach_mutex);
 
-   if (pool->cpu == cpu)
+   if (pool->cpu == cpu) {
+   pool->node = cpu_to_node(cpu);
rebind_workers(pool);
-   else if (pool->cpu < 0)
+   } else if (pool->cpu < 0) {
restore_unbound_workers_cpumask(pool, cpu);
+   }
 
mutex_unlock(>attach_mutex);
}
@@ -5495,8 +5497,6 @@ int __init workqueue_init_early(void)
 
pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
 
-   wq_numa_init();
-
/* initialize CPU pools */
for_each_possible_cpu(cpu) {
struct worker_pool *pool;
@@ -5571,6 +5571,9 @@ int __init workqueue_init(void)
struct worker_pool *pool;
int cpu, bkt;
 
+   wq_numa_init();
+   wq_update_unbound_numa(wq, smp_processor_id(), true);
+
/* create the initial workers */
for_each_online_cpu(cpu) {
for_each_cpu_worker_pool(pool, cpu) {


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-17 Thread Tejun Heo
Hello, Michael.

Other NUMA archs are lazy-initializing cpu to node mapping too, so we
need to fix it from workqueue side.  This also means that we've been
getting NUMA node wrong for percpu pools on those archs.

Can you please try the following patch and if it resolves the issue,
report the workqueue part (it's at the end) of sysrq-t dump?

Thanks.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 984f6ff..276557b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4411,14 +4411,14 @@ void show_workqueue_state(void)
break;
}
}
-   if (idle)
-   continue;
+   //if (idle)
+   //  continue;
 
pr_info("workqueue %s: flags=0x%x\n", wq->name, wq->flags);
 
for_each_pwq(pwq, wq) {
spin_lock_irqsave(>pool->lock, flags);
-   if (pwq->nr_active || !list_empty(>delayed_works))
+   //if (pwq->nr_active || 
!list_empty(>delayed_works))
show_pwq(pwq);
spin_unlock_irqrestore(>pool->lock, flags);
}
@@ -4429,8 +4429,8 @@ void show_workqueue_state(void)
bool first = true;
 
spin_lock_irqsave(>lock, flags);
-   if (pool->nr_workers == pool->nr_idle)
-   goto next_pool;
+   //if (pool->nr_workers == pool->nr_idle)
+   //  goto next_pool;
 
pr_info("pool %d:", pool->id);
pr_cont_pool_info(pool);
@@ -4649,10 +4649,12 @@ int workqueue_online_cpu(unsigned int cpu)
for_each_pool(pool, pi) {
mutex_lock(>attach_mutex);
 
-   if (pool->cpu == cpu)
+   if (pool->cpu == cpu) {
+   pool->node = cpu_to_node(cpu);
rebind_workers(pool);
-   else if (pool->cpu < 0)
+   } else if (pool->cpu < 0) {
restore_unbound_workers_cpumask(pool, cpu);
+   }
 
mutex_unlock(>attach_mutex);
}
@@ -5495,8 +5497,6 @@ int __init workqueue_init_early(void)
 
pwq_cache = KMEM_CACHE(pool_workqueue, SLAB_PANIC);
 
-   wq_numa_init();
-
/* initialize CPU pools */
for_each_possible_cpu(cpu) {
struct worker_pool *pool;
@@ -5571,6 +5571,9 @@ int __init workqueue_init(void)
struct worker_pool *pool;
int cpu, bkt;
 
+   wq_numa_init();
+   wq_update_unbound_numa(wq, smp_processor_id(), true);
+
/* create the initial workers */
for_each_online_cpu(cpu) {
for_each_cpu_worker_pool(pool, cpu) {


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-17 Thread Tejun Heo
Hello, Michael.

On Mon, Oct 17, 2016 at 11:24:34PM +1100, Michael Ellerman wrote:
> The bad case (where we hit the BUG_ON I added above) is where we are
> creating a wq for node 1.
> 
> In wq_calc_node_cpumask() we do:
> 
>   cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
>   return !cpumask_equal(cpumask, attrs->cpumask);
> 
> Which with the arguments inserted is:
> 
>   cpumask_and(tmp_attrs->cpumask, new_attrs->cpumask, 
> wq_numa_possible_cpumask[1]);
>   return !cpumask_equal(tmp_attrs->cpumask, new_attrs->cpumask);
> 
> And that results in tmp_attrs->cpumask being empty, because
> wq_numa_possible_cpumask[1] is an empty cpumask.

Ah, should have read this before replying to the previous mail, so
it's the numa mask, not the cpu_possible_mask.

> The reason wq_numa_possible_cpumask[1] is an empty mask is because in
> wq_numa_init() we did:
> 
>   for_each_possible_cpu(cpu) {
>   node = cpu_to_node(cpu);
>   if (WARN_ON(node == NUMA_NO_NODE)) {
>   pr_warn("workqueue: NUMA node mapping not available for 
> cpu%d, disabling NUMA support\n", cpu);
>   /* happens iff arch is bonkers, let's just proceed */
>   return;
>   }
>   cpumask_set_cpu(cpu, tbl[node]);
>   }
> 
> And cpu_to_node() returned node 0 for every CPU in the system, despite there
> being multiple nodes.
> 
> That happened because we haven't yet called set_cpu_numa_node() for the 
> non-boot
> cpus, because that happens in smp_prepare_cpus(), and
> workqueue_init_early() is called much earlier than that.
> 
> This doesn't trigger on x86 because it does set_cpu_numa_node() in
> setup_per_cpu_areas(), which is called prior to workqueue_init_early().
> 
> We can (should) probably do the same on powerpc, I'll look at that
> tomorrow. But other arches may have a similar problem, and at the very
> least we need to document that workqueue_init_early() relies on
> cpu_to_node() working.

I should be able to move the numa part of initialization to the later
init function.  Working on it.

Thanks.

-- 
tejun


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-17 Thread Tejun Heo
Hello, Michael.

On Mon, Oct 17, 2016 at 11:24:34PM +1100, Michael Ellerman wrote:
> The bad case (where we hit the BUG_ON I added above) is where we are
> creating a wq for node 1.
> 
> In wq_calc_node_cpumask() we do:
> 
>   cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
>   return !cpumask_equal(cpumask, attrs->cpumask);
> 
> Which with the arguments inserted is:
> 
>   cpumask_and(tmp_attrs->cpumask, new_attrs->cpumask, 
> wq_numa_possible_cpumask[1]);
>   return !cpumask_equal(tmp_attrs->cpumask, new_attrs->cpumask);
> 
> And that results in tmp_attrs->cpumask being empty, because
> wq_numa_possible_cpumask[1] is an empty cpumask.

Ah, should have read this before replying to the previous mail, so
it's the numa mask, not the cpu_possible_mask.

> The reason wq_numa_possible_cpumask[1] is an empty mask is because in
> wq_numa_init() we did:
> 
>   for_each_possible_cpu(cpu) {
>   node = cpu_to_node(cpu);
>   if (WARN_ON(node == NUMA_NO_NODE)) {
>   pr_warn("workqueue: NUMA node mapping not available for 
> cpu%d, disabling NUMA support\n", cpu);
>   /* happens iff arch is bonkers, let's just proceed */
>   return;
>   }
>   cpumask_set_cpu(cpu, tbl[node]);
>   }
> 
> And cpu_to_node() returned node 0 for every CPU in the system, despite there
> being multiple nodes.
> 
> That happened because we haven't yet called set_cpu_numa_node() for the 
> non-boot
> cpus, because that happens in smp_prepare_cpus(), and
> workqueue_init_early() is called much earlier than that.
> 
> This doesn't trigger on x86 because it does set_cpu_numa_node() in
> setup_per_cpu_areas(), which is called prior to workqueue_init_early().
> 
> We can (should) probably do the same on powerpc, I'll look at that
> tomorrow. But other arches may have a similar problem, and at the very
> least we need to document that workqueue_init_early() relies on
> cpu_to_node() working.

I should be able to move the numa part of initialization to the later
init function.  Working on it.

Thanks.

-- 
tejun


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-17 Thread Tejun Heo
Hello,

On Sat, Oct 15, 2016 at 08:48:01PM +1100, Michael Ellerman wrote:
> > Hmm... it doesn't reproduce it here and can't see how the commit would
> > affect this given that it doesn't really change when the kworker
> > kthreads are being created.
> 
> Try turning on CONFIG_DEBUG_PER_CPU_MAPS=y ?
> 
> That will warn if you're indexing off the end of a cpu mask and just
> getting lucky with the result.

That's not happening on x86.  That could mean that powerpc is
initializing cpu_possible_mask after workqueue_init_early().  Looking
into it.

> [ cut here ]
> WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:1602 
> try_to_wake_up+0x3f4/0x5c0
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 4.8.0-compiler_gcc-6.2.0-next-20161014-dirty #116
> task: c00ff920 task.stack: c01ffc084000
> NIP: c00f1ba4 LR: c00f180c CTR: 
> REGS: c01ffc0878f0 TRAP: 0700   Not tainted  
> (4.8.0-compiler_gcc-6.2.0-next-20161014-dirty)
> MSR: 92029033  CR: 28000422  XER: 
> 
> CFAR: c00f18bc SOFTE: 0 
> GPR00: c00f180c c01ffc087b70 c0e83400  
> GPR04: 0002    
> GPR08: c0dc3400 0001 0002  
> GPR12:  cfb8 c000e0c8  
> GPR16:     
> GPR20:    c0eb8960 
> GPR24:  c0d8ce00   
> GPR28: c007f54050f4   c007f5404900 
> NIP [c00f1ba4] try_to_wake_up+0x3f4/0x5c0
> LR [c00f180c] try_to_wake_up+0x5c/0x5c0
> Call Trace:
> [c01ffc087b70] [c00f180c] try_to_wake_up+0x5c/0x5c0 (unreliable)
> [c01ffc087bf0] [c00d53e4] create_worker+0x144/0x250
> [c01ffc087c90] [c0cf7930] workqueue_init+0x170/0x19c
> [c01ffc087d00] [c0ce0e74] kernel_init_freeable+0x158/0x360
> [c01ffc087dc0] [c000e0e4] kernel_init+0x24/0x160
> [c01ffc087e30] [c000bfa0] ret_from_kernel_thread+0x5c/0xbc
> Instruction dump:
> e8790890 4bff6ed9 2fa3 419e00dc 6000 4bfffe54 3d02fff4 8928d7f9 
> 2f89 409e0018 3921 9928d7f9 <0fe0> 6000 6042 3b5f0368 
> ---[ end trace  ]---
> 
> But I'm not sure that tells us anything new?

Yeah, I should have asked to print out information of the target task
but it looks like we have enough information now.

Thanks.

-- 
tejun


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-17 Thread Tejun Heo
Hello,

On Sat, Oct 15, 2016 at 08:48:01PM +1100, Michael Ellerman wrote:
> > Hmm... it doesn't reproduce it here and can't see how the commit would
> > affect this given that it doesn't really change when the kworker
> > kthreads are being created.
> 
> Try turning on CONFIG_DEBUG_PER_CPU_MAPS=y ?
> 
> That will warn if you're indexing off the end of a cpu mask and just
> getting lucky with the result.

That's not happening on x86.  That could mean that powerpc is
initializing cpu_possible_mask after workqueue_init_early().  Looking
into it.

> [ cut here ]
> WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:1602 
> try_to_wake_up+0x3f4/0x5c0
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 4.8.0-compiler_gcc-6.2.0-next-20161014-dirty #116
> task: c00ff920 task.stack: c01ffc084000
> NIP: c00f1ba4 LR: c00f180c CTR: 
> REGS: c01ffc0878f0 TRAP: 0700   Not tainted  
> (4.8.0-compiler_gcc-6.2.0-next-20161014-dirty)
> MSR: 92029033  CR: 28000422  XER: 
> 
> CFAR: c00f18bc SOFTE: 0 
> GPR00: c00f180c c01ffc087b70 c0e83400  
> GPR04: 0002    
> GPR08: c0dc3400 0001 0002  
> GPR12:  cfb8 c000e0c8  
> GPR16:     
> GPR20:    c0eb8960 
> GPR24:  c0d8ce00   
> GPR28: c007f54050f4   c007f5404900 
> NIP [c00f1ba4] try_to_wake_up+0x3f4/0x5c0
> LR [c00f180c] try_to_wake_up+0x5c/0x5c0
> Call Trace:
> [c01ffc087b70] [c00f180c] try_to_wake_up+0x5c/0x5c0 (unreliable)
> [c01ffc087bf0] [c00d53e4] create_worker+0x144/0x250
> [c01ffc087c90] [c0cf7930] workqueue_init+0x170/0x19c
> [c01ffc087d00] [c0ce0e74] kernel_init_freeable+0x158/0x360
> [c01ffc087dc0] [c000e0e4] kernel_init+0x24/0x160
> [c01ffc087e30] [c000bfa0] ret_from_kernel_thread+0x5c/0xbc
> Instruction dump:
> e8790890 4bff6ed9 2fa3 419e00dc 6000 4bfffe54 3d02fff4 8928d7f9 
> 2f89 409e0018 3921 9928d7f9 <0fe0> 6000 6042 3b5f0368 
> ---[ end trace  ]---
> 
> But I'm not sure that tells us anything new?

Yeah, I should have asked to print out information of the target task
but it looks like we have enough information now.

Thanks.

-- 
tejun


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-17 Thread Balbir Singh


On 17/10/16 23:24, Michael Ellerman wrote:
> Tejun Heo  writes:
> 
>> Hello, Michael.
>>
>> On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote:
>>> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
>>> is NULL.
>>>
>>> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
>>> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
>>> set_task_rq() and happen to get NULL.
>>>
>>> We never should have done set_task_rq(p, 2048), because 2048 is >=
>>> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
>>> doesn't cope with that.
>>
>> Hmm... it doesn't reproduce it here and can't see how the commit would
>> affect this given that it doesn't really change when the kworker
>> kthreads are being created.
> 
> It changes when the pool attributes are created, which is the source of
> the bug.
> 
> The original crash happens because we have a task with an empty cpus_allowed
> mask. That mask originally comes from pool->attrs->cpumask.
> 
> The attrs for the pool are created early via workqueue_init_early() in
> apply_wqattrs_prepare():
> 
>   start_here_common
>   -> start_kernel
>  -> workqueue_init_early
> -> __alloc_workqueue_key
>-> apply_workqueue_attrs
>   -> apply_workqueue_attrs_locked
>  -> apply_wqattrs_prepare
> 
> In there we do:
> 
>   copy_workqueue_attrs(new_attrs, attrs);
>   cpumask_and(new_attrs->cpumask, new_attrs->cpumask, wq_unbound_cpumask);
>   if (unlikely(cpumask_empty(new_attrs->cpumask)))
>   cpumask_copy(new_attrs->cpumask, wq_unbound_cpumask);
>   ...
>   copy_workqueue_attrs(tmp_attrs, new_attrs);
>   ...
>   for_each_node(node) {
>   if (wq_calc_node_cpumask(new_attrs, node, -1, 
> tmp_attrs->cpumask)) {
> + BUG_ON(cpumask_empty(tmp_attrs->cpumask));
>   ctx->pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
> 
> 
> The bad case (where we hit the BUG_ON I added above) is where we are
> creating a wq for node 1.
> 
> In wq_calc_node_cpumask() we do:
> 
>   cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
>   return !cpumask_equal(cpumask, attrs->cpumask);
> 
> Which with the arguments inserted is:
> 
>   cpumask_and(tmp_attrs->cpumask, new_attrs->cpumask, 
> wq_numa_possible_cpumask[1]);
>   return !cpumask_equal(tmp_attrs->cpumask, new_attrs->cpumask);
> 
> And that results in tmp_attrs->cpumask being empty, because
> wq_numa_possible_cpumask[1] is an empty cpumask.
> 
> The reason wq_numa_possible_cpumask[1] is an empty mask is because in
> wq_numa_init() we did:
> 
>   for_each_possible_cpu(cpu) {
>   node = cpu_to_node(cpu);
>   if (WARN_ON(node == NUMA_NO_NODE)) {
>   pr_warn("workqueue: NUMA node mapping not available for 
> cpu%d, disabling NUMA support\n", cpu);
>   /* happens iff arch is bonkers, let's just proceed */
>   return;
>   }
>   cpumask_set_cpu(cpu, tbl[node]);
>   }
> 
> And cpu_to_node() returned node 0 for every CPU in the system, despite there
> being multiple nodes.
> 
> That happened because we haven't yet called set_cpu_numa_node() for the 
> non-boot
> cpus, because that happens in smp_prepare_cpus(), and
> workqueue_init_early() is called much earlier than that.
> 
> This doesn't trigger on x86 because it does set_cpu_numa_node() in
> setup_per_cpu_areas(), which is called prior to workqueue_init_early().
> 
> We can (should) probably do the same on powerpc, I'll look at that
> tomorrow. But other arches may have a similar problem, and at the very
> least we need to document that workqueue_init_early() relies on
> cpu_to_node() working.

Don't we do the setup cpu->node mapings in initmem_init()?
Ideally we have setup_arch->intmem_init->numa_setup_cpu

Will look at it tomorrow
Balbir Singh






Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-17 Thread Balbir Singh


On 17/10/16 23:24, Michael Ellerman wrote:
> Tejun Heo  writes:
> 
>> Hello, Michael.
>>
>> On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote:
>>> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
>>> is NULL.
>>>
>>> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
>>> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
>>> set_task_rq() and happen to get NULL.
>>>
>>> We never should have done set_task_rq(p, 2048), because 2048 is >=
>>> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
>>> doesn't cope with that.
>>
>> Hmm... it doesn't reproduce it here and can't see how the commit would
>> affect this given that it doesn't really change when the kworker
>> kthreads are being created.
> 
> It changes when the pool attributes are created, which is the source of
> the bug.
> 
> The original crash happens because we have a task with an empty cpus_allowed
> mask. That mask originally comes from pool->attrs->cpumask.
> 
> The attrs for the pool are created early via workqueue_init_early() in
> apply_wqattrs_prepare():
> 
>   start_here_common
>   -> start_kernel
>  -> workqueue_init_early
> -> __alloc_workqueue_key
>-> apply_workqueue_attrs
>   -> apply_workqueue_attrs_locked
>  -> apply_wqattrs_prepare
> 
> In there we do:
> 
>   copy_workqueue_attrs(new_attrs, attrs);
>   cpumask_and(new_attrs->cpumask, new_attrs->cpumask, wq_unbound_cpumask);
>   if (unlikely(cpumask_empty(new_attrs->cpumask)))
>   cpumask_copy(new_attrs->cpumask, wq_unbound_cpumask);
>   ...
>   copy_workqueue_attrs(tmp_attrs, new_attrs);
>   ...
>   for_each_node(node) {
>   if (wq_calc_node_cpumask(new_attrs, node, -1, 
> tmp_attrs->cpumask)) {
> + BUG_ON(cpumask_empty(tmp_attrs->cpumask));
>   ctx->pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);
> 
> 
> The bad case (where we hit the BUG_ON I added above) is where we are
> creating a wq for node 1.
> 
> In wq_calc_node_cpumask() we do:
> 
>   cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
>   return !cpumask_equal(cpumask, attrs->cpumask);
> 
> Which with the arguments inserted is:
> 
>   cpumask_and(tmp_attrs->cpumask, new_attrs->cpumask, 
> wq_numa_possible_cpumask[1]);
>   return !cpumask_equal(tmp_attrs->cpumask, new_attrs->cpumask);
> 
> And that results in tmp_attrs->cpumask being empty, because
> wq_numa_possible_cpumask[1] is an empty cpumask.
> 
> The reason wq_numa_possible_cpumask[1] is an empty mask is because in
> wq_numa_init() we did:
> 
>   for_each_possible_cpu(cpu) {
>   node = cpu_to_node(cpu);
>   if (WARN_ON(node == NUMA_NO_NODE)) {
>   pr_warn("workqueue: NUMA node mapping not available for 
> cpu%d, disabling NUMA support\n", cpu);
>   /* happens iff arch is bonkers, let's just proceed */
>   return;
>   }
>   cpumask_set_cpu(cpu, tbl[node]);
>   }
> 
> And cpu_to_node() returned node 0 for every CPU in the system, despite there
> being multiple nodes.
> 
> That happened because we haven't yet called set_cpu_numa_node() for the 
> non-boot
> cpus, because that happens in smp_prepare_cpus(), and
> workqueue_init_early() is called much earlier than that.
> 
> This doesn't trigger on x86 because it does set_cpu_numa_node() in
> setup_per_cpu_areas(), which is called prior to workqueue_init_early().
> 
> We can (should) probably do the same on powerpc, I'll look at that
> tomorrow. But other arches may have a similar problem, and at the very
> least we need to document that workqueue_init_early() relies on
> cpu_to_node() working.

Don't we do the setup cpu->node mapings in initmem_init()?
Ideally we have setup_arch->intmem_init->numa_setup_cpu

Will look at it tomorrow
Balbir Singh






Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-17 Thread Michael Ellerman
Tejun Heo  writes:

> Hello, Michael.
>
> On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote:
>> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
>> is NULL.
>> 
>> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
>> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
>> set_task_rq() and happen to get NULL.
>> 
>> We never should have done set_task_rq(p, 2048), because 2048 is >=
>> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
>> doesn't cope with that.
>
> Hmm... it doesn't reproduce it here and can't see how the commit would
> affect this given that it doesn't really change when the kworker
> kthreads are being created.

It changes when the pool attributes are created, which is the source of
the bug.

The original crash happens because we have a task with an empty cpus_allowed
mask. That mask originally comes from pool->attrs->cpumask.

The attrs for the pool are created early via workqueue_init_early() in
apply_wqattrs_prepare():

  start_here_common
  -> start_kernel
 -> workqueue_init_early
-> __alloc_workqueue_key
   -> apply_workqueue_attrs
  -> apply_workqueue_attrs_locked
 -> apply_wqattrs_prepare
  
In there we do:

copy_workqueue_attrs(new_attrs, attrs);
cpumask_and(new_attrs->cpumask, new_attrs->cpumask, wq_unbound_cpumask);
if (unlikely(cpumask_empty(new_attrs->cpumask)))
cpumask_copy(new_attrs->cpumask, wq_unbound_cpumask);
...
copy_workqueue_attrs(tmp_attrs, new_attrs);
...
for_each_node(node) {
if (wq_calc_node_cpumask(new_attrs, node, -1, 
tmp_attrs->cpumask)) {
+   BUG_ON(cpumask_empty(tmp_attrs->cpumask));
ctx->pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);


The bad case (where we hit the BUG_ON I added above) is where we are
creating a wq for node 1.

In wq_calc_node_cpumask() we do:

cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
return !cpumask_equal(cpumask, attrs->cpumask);

Which with the arguments inserted is:

cpumask_and(tmp_attrs->cpumask, new_attrs->cpumask, 
wq_numa_possible_cpumask[1]);
return !cpumask_equal(tmp_attrs->cpumask, new_attrs->cpumask);

And that results in tmp_attrs->cpumask being empty, because
wq_numa_possible_cpumask[1] is an empty cpumask.

The reason wq_numa_possible_cpumask[1] is an empty mask is because in
wq_numa_init() we did:

for_each_possible_cpu(cpu) {
node = cpu_to_node(cpu);
if (WARN_ON(node == NUMA_NO_NODE)) {
pr_warn("workqueue: NUMA node mapping not available for 
cpu%d, disabling NUMA support\n", cpu);
/* happens iff arch is bonkers, let's just proceed */
return;
}
cpumask_set_cpu(cpu, tbl[node]);
}

And cpu_to_node() returned node 0 for every CPU in the system, despite there
being multiple nodes.

That happened because we haven't yet called set_cpu_numa_node() for the non-boot
cpus, because that happens in smp_prepare_cpus(), and
workqueue_init_early() is called much earlier than that.

This doesn't trigger on x86 because it does set_cpu_numa_node() in
setup_per_cpu_areas(), which is called prior to workqueue_init_early().

We can (should) probably do the same on powerpc, I'll look at that
tomorrow. But other arches may have a similar problem, and at the very
least we need to document that workqueue_init_early() relies on
cpu_to_node() working.

cheers


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-17 Thread Michael Ellerman
Tejun Heo  writes:

> Hello, Michael.
>
> On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote:
>> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
>> is NULL.
>> 
>> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
>> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
>> set_task_rq() and happen to get NULL.
>> 
>> We never should have done set_task_rq(p, 2048), because 2048 is >=
>> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
>> doesn't cope with that.
>
> Hmm... it doesn't reproduce it here and can't see how the commit would
> affect this given that it doesn't really change when the kworker
> kthreads are being created.

It changes when the pool attributes are created, which is the source of
the bug.

The original crash happens because we have a task with an empty cpus_allowed
mask. That mask originally comes from pool->attrs->cpumask.

The attrs for the pool are created early via workqueue_init_early() in
apply_wqattrs_prepare():

  start_here_common
  -> start_kernel
 -> workqueue_init_early
-> __alloc_workqueue_key
   -> apply_workqueue_attrs
  -> apply_workqueue_attrs_locked
 -> apply_wqattrs_prepare
  
In there we do:

copy_workqueue_attrs(new_attrs, attrs);
cpumask_and(new_attrs->cpumask, new_attrs->cpumask, wq_unbound_cpumask);
if (unlikely(cpumask_empty(new_attrs->cpumask)))
cpumask_copy(new_attrs->cpumask, wq_unbound_cpumask);
...
copy_workqueue_attrs(tmp_attrs, new_attrs);
...
for_each_node(node) {
if (wq_calc_node_cpumask(new_attrs, node, -1, 
tmp_attrs->cpumask)) {
+   BUG_ON(cpumask_empty(tmp_attrs->cpumask));
ctx->pwq_tbl[node] = alloc_unbound_pwq(wq, tmp_attrs);


The bad case (where we hit the BUG_ON I added above) is where we are
creating a wq for node 1.

In wq_calc_node_cpumask() we do:

cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
return !cpumask_equal(cpumask, attrs->cpumask);

Which with the arguments inserted is:

cpumask_and(tmp_attrs->cpumask, new_attrs->cpumask, 
wq_numa_possible_cpumask[1]);
return !cpumask_equal(tmp_attrs->cpumask, new_attrs->cpumask);

And that results in tmp_attrs->cpumask being empty, because
wq_numa_possible_cpumask[1] is an empty cpumask.

The reason wq_numa_possible_cpumask[1] is an empty mask is because in
wq_numa_init() we did:

for_each_possible_cpu(cpu) {
node = cpu_to_node(cpu);
if (WARN_ON(node == NUMA_NO_NODE)) {
pr_warn("workqueue: NUMA node mapping not available for 
cpu%d, disabling NUMA support\n", cpu);
/* happens iff arch is bonkers, let's just proceed */
return;
}
cpumask_set_cpu(cpu, tbl[node]);
}

And cpu_to_node() returned node 0 for every CPU in the system, despite there
being multiple nodes.

That happened because we haven't yet called set_cpu_numa_node() for the non-boot
cpus, because that happens in smp_prepare_cpus(), and
workqueue_init_early() is called much earlier than that.

This doesn't trigger on x86 because it does set_cpu_numa_node() in
setup_per_cpu_areas(), which is called prior to workqueue_init_early().

We can (should) probably do the same on powerpc, I'll look at that
tomorrow. But other arches may have a similar problem, and at the very
least we need to document that workqueue_init_early() relies on
cpu_to_node() working.

cheers


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-15 Thread Michael Ellerman
Tejun Heo  writes:

> Hello, Michael.
>
> On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote:
>> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
>> is NULL.
>> 
>> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
>> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
>> set_task_rq() and happen to get NULL.
>> 
>> We never should have done set_task_rq(p, 2048), because 2048 is >=
>> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
>> doesn't cope with that.
>
> Hmm... it doesn't reproduce it here and can't see how the commit would
> affect this given that it doesn't really change when the kworker
> kthreads are being created.

Try turning on CONFIG_DEBUG_PER_CPU_MAPS=y ?

That will warn if you're indexing off the end of a cpu mask and just
getting lucky with the result.

>> Presumably we shouldn't be ending up with tsk_cpus_allowed() being
>> empty, but I haven't had time to track down why that's happening.
>
> Can you please add WARN_ON_ONCE(!tsk_nr_cpus_allowed(p)) to
> select_task_rq() and post what that says?

It says:

[ cut here ]
WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:1602 try_to_wake_up+0x3f4/0x5c0
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
4.8.0-compiler_gcc-6.2.0-next-20161014-dirty #116
task: c00ff920 task.stack: c01ffc084000
NIP: c00f1ba4 LR: c00f180c CTR: 
REGS: c01ffc0878f0 TRAP: 0700   Not tainted  
(4.8.0-compiler_gcc-6.2.0-next-20161014-dirty)
MSR: 92029033  CR: 28000422  XER: 
CFAR: c00f18bc SOFTE: 0 
GPR00: c00f180c c01ffc087b70 c0e83400  
GPR04: 0002    
GPR08: c0dc3400 0001 0002  
GPR12:  cfb8 c000e0c8  
GPR16:     
GPR20:    c0eb8960 
GPR24:  c0d8ce00   
GPR28: c007f54050f4   c007f5404900 
NIP [c00f1ba4] try_to_wake_up+0x3f4/0x5c0
LR [c00f180c] try_to_wake_up+0x5c/0x5c0
Call Trace:
[c01ffc087b70] [c00f180c] try_to_wake_up+0x5c/0x5c0 (unreliable)
[c01ffc087bf0] [c00d53e4] create_worker+0x144/0x250
[c01ffc087c90] [c0cf7930] workqueue_init+0x170/0x19c
[c01ffc087d00] [c0ce0e74] kernel_init_freeable+0x158/0x360
[c01ffc087dc0] [c000e0e4] kernel_init+0x24/0x160
[c01ffc087e30] [c000bfa0] ret_from_kernel_thread+0x5c/0xbc
Instruction dump:
e8790890 4bff6ed9 2fa3 419e00dc 6000 4bfffe54 3d02fff4 8928d7f9 
2f89 409e0018 3921 9928d7f9 <0fe0> 6000 6042 3b5f0368 
---[ end trace  ]---


But I'm not sure that tells us anything new?

cheers


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-15 Thread Michael Ellerman
Tejun Heo  writes:

> Hello, Michael.
>
> On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote:
>> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
>> is NULL.
>> 
>> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
>> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
>> set_task_rq() and happen to get NULL.
>> 
>> We never should have done set_task_rq(p, 2048), because 2048 is >=
>> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
>> doesn't cope with that.
>
> Hmm... it doesn't reproduce it here and can't see how the commit would
> affect this given that it doesn't really change when the kworker
> kthreads are being created.

Try turning on CONFIG_DEBUG_PER_CPU_MAPS=y ?

That will warn if you're indexing off the end of a cpu mask and just
getting lucky with the result.

>> Presumably we shouldn't be ending up with tsk_cpus_allowed() being
>> empty, but I haven't had time to track down why that's happening.
>
> Can you please add WARN_ON_ONCE(!tsk_nr_cpus_allowed(p)) to
> select_task_rq() and post what that says?

It says:

[ cut here ]
WARNING: CPU: 0 PID: 1 at ../kernel/sched/core.c:1602 try_to_wake_up+0x3f4/0x5c0
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
4.8.0-compiler_gcc-6.2.0-next-20161014-dirty #116
task: c00ff920 task.stack: c01ffc084000
NIP: c00f1ba4 LR: c00f180c CTR: 
REGS: c01ffc0878f0 TRAP: 0700   Not tainted  
(4.8.0-compiler_gcc-6.2.0-next-20161014-dirty)
MSR: 92029033  CR: 28000422  XER: 
CFAR: c00f18bc SOFTE: 0 
GPR00: c00f180c c01ffc087b70 c0e83400  
GPR04: 0002    
GPR08: c0dc3400 0001 0002  
GPR12:  cfb8 c000e0c8  
GPR16:     
GPR20:    c0eb8960 
GPR24:  c0d8ce00   
GPR28: c007f54050f4   c007f5404900 
NIP [c00f1ba4] try_to_wake_up+0x3f4/0x5c0
LR [c00f180c] try_to_wake_up+0x5c/0x5c0
Call Trace:
[c01ffc087b70] [c00f180c] try_to_wake_up+0x5c/0x5c0 (unreliable)
[c01ffc087bf0] [c00d53e4] create_worker+0x144/0x250
[c01ffc087c90] [c0cf7930] workqueue_init+0x170/0x19c
[c01ffc087d00] [c0ce0e74] kernel_init_freeable+0x158/0x360
[c01ffc087dc0] [c000e0e4] kernel_init+0x24/0x160
[c01ffc087e30] [c000bfa0] ret_from_kernel_thread+0x5c/0xbc
Instruction dump:
e8790890 4bff6ed9 2fa3 419e00dc 6000 4bfffe54 3d02fff4 8928d7f9 
2f89 409e0018 3921 9928d7f9 <0fe0> 6000 6042 3b5f0368 
---[ end trace  ]---


But I'm not sure that tells us anything new?

cheers


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-14 Thread Balbir Singh


On 15/10/16 02:08, Tejun Heo wrote:
> Hello, Balbir.
> 
> On Tue, Oct 11, 2016 at 11:21:09PM +1100, Balbir Singh wrote:
>> FYI: I see the samething on my cpu as well, its just that I get lucky
>> and cpu_online(cpu) returns false.
> 
> Are you seeing this on x86 or is your test setup also a power machine?
> 
> Thanks.
> 

I saw this on a powerpc box

Balbir Singh.


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-14 Thread Balbir Singh


On 15/10/16 02:08, Tejun Heo wrote:
> Hello, Balbir.
> 
> On Tue, Oct 11, 2016 at 11:21:09PM +1100, Balbir Singh wrote:
>> FYI: I see the samething on my cpu as well, its just that I get lucky
>> and cpu_online(cpu) returns false.
> 
> Are you seeing this on x86 or is your test setup also a power machine?
> 
> Thanks.
> 

I saw this on a powerpc box

Balbir Singh.


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-14 Thread Balbir Singh


On 15/10/16 02:07, Tejun Heo wrote:
> Hello, Michael.
> 
> On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote:
>> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
>> is NULL.
>>
>> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
>> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
>> set_task_rq() and happen to get NULL.
>>
>> We never should have done set_task_rq(p, 2048), because 2048 is >=
>> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
>> doesn't cope with that.
> 
> Hmm... it doesn't reproduce it here and can't see how the commit would
> affect this given that it doesn't really change when the kworker
> kthreads are being created.
> 
>> Presumably we shouldn't be ending up with tsk_cpus_allowed() being
>> empty, but I haven't had time to track down why that's happening.
> 

I think the basic analysis shows the change to creation of unbounded
workqueues from the unbound_hash, but those have a pool cpumask empty.

> Can you please add WARN_ON_ONCE(!tsk_nr_cpus_allowed(p)) to
> select_task_rq() and post what that says?
> 
> Thanks.
> 

Balbir Singh.


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-14 Thread Balbir Singh


On 15/10/16 02:07, Tejun Heo wrote:
> Hello, Michael.
> 
> On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote:
>> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
>> is NULL.
>>
>> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
>> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
>> set_task_rq() and happen to get NULL.
>>
>> We never should have done set_task_rq(p, 2048), because 2048 is >=
>> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
>> doesn't cope with that.
> 
> Hmm... it doesn't reproduce it here and can't see how the commit would
> affect this given that it doesn't really change when the kworker
> kthreads are being created.
> 
>> Presumably we shouldn't be ending up with tsk_cpus_allowed() being
>> empty, but I haven't had time to track down why that's happening.
> 

I think the basic analysis shows the change to creation of unbounded
workqueues from the unbound_hash, but those have a pool cpumask empty.

> Can you please add WARN_ON_ONCE(!tsk_nr_cpus_allowed(p)) to
> select_task_rq() and post what that says?
> 
> Thanks.
> 

Balbir Singh.


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-14 Thread Tejun Heo
Hello, Balbir.

On Tue, Oct 11, 2016 at 11:21:09PM +1100, Balbir Singh wrote:
> FYI: I see the samething on my cpu as well, its just that I get lucky
> and cpu_online(cpu) returns false.

Are you seeing this on x86 or is your test setup also a power machine?

Thanks.

-- 
tejun


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-14 Thread Tejun Heo
Hello, Balbir.

On Tue, Oct 11, 2016 at 11:21:09PM +1100, Balbir Singh wrote:
> FYI: I see the samething on my cpu as well, its just that I get lucky
> and cpu_online(cpu) returns false.

Are you seeing this on x86 or is your test setup also a power machine?

Thanks.

-- 
tejun


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-14 Thread Tejun Heo
Hello, Michael.

On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote:
> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
> is NULL.
> 
> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
> set_task_rq() and happen to get NULL.
> 
> We never should have done set_task_rq(p, 2048), because 2048 is >=
> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
> doesn't cope with that.

Hmm... it doesn't reproduce it here and can't see how the commit would
affect this given that it doesn't really change when the kworker
kthreads are being created.

> Presumably we shouldn't be ending up with tsk_cpus_allowed() being
> empty, but I haven't had time to track down why that's happening.

Can you please add WARN_ON_ONCE(!tsk_nr_cpus_allowed(p)) to
select_task_rq() and post what that says?

Thanks.

-- 
tejun


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-14 Thread Tejun Heo
Hello, Michael.

On Tue, Oct 11, 2016 at 10:22:13PM +1100, Michael Ellerman wrote:
> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
> is NULL.
> 
> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
> set_task_rq() and happen to get NULL.
> 
> We never should have done set_task_rq(p, 2048), because 2048 is >=
> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
> doesn't cope with that.

Hmm... it doesn't reproduce it here and can't see how the commit would
affect this given that it doesn't really change when the kworker
kthreads are being created.

> Presumably we shouldn't be ending up with tsk_cpus_allowed() being
> empty, but I haven't had time to track down why that's happening.

Can you please add WARN_ON_ONCE(!tsk_nr_cpus_allowed(p)) to
select_task_rq() and post what that says?

Thanks.

-- 
tejun


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-11 Thread Balbir Singh


On 11/10/16 22:22, Michael Ellerman wrote:
> Tejun Heo  writes:
> 
>> Hello, Michael.
>>
>> On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote:
>>> This patch seems to be causing one of my Power8 boxes not to boot.
>>>
>>> Specifically commit 3347fa092821 ("workqueue: make workqueue available
>>> early during boot") in linux-next.
>>>
>>> If I revert this on top of next-20161005 then the machine boots again.
>>>
>>> I've attached the oops below. It looks like the cfs_rq of p->se is NULL?
>>
>> Hah, weird that it's arch dependent, or maybe it's just different
>> config options.  Most likely, it's caused by workqueue_init() call
>> being moved too early.  Can you please try the following patch and see
>> whether the problem goes away?
> 
> No that doesn't help.
> 
> What does is this:
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 94732d1ab00a..4e79549d242f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1614,7 +1614,8 @@ int select_task_rq(struct task_struct *p, int cpu, int 
> sd_flags, int wake_flags)
>* [ this allows ->select_task() to simply return task_cpu(p) and
>*   not worry about this generic constraint ]
>*/
> - if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
> + if (unlikely(cpu >= nr_cpu_ids ||
> +  !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
>!cpu_online(cpu)))
>   cpu = select_fallback_rq(task_cpu(p), p);
>  
> 
> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
> is NULL.
> 
> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
> set_task_rq() and happen to get NULL.
> 
> We never should have done set_task_rq(p, 2048), because 2048 is >=
> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
> doesn't cope with that.
> 
> The reason we're calling set_task_rq() with CPU 2048 is because
> in select_task_rq() we had tsk_nr_cpus_allowed() = 0, because
> tsk_cpus_allowed(p) is an empty cpu mask.
> 
> That means we do in select_task_rq():
>   cpu = cpumask_any(tsk_cpus_allowed(p)); 
>   
>  
> 
> And when tsk_cpus_allowed(p) is empty cpumask_any() returns nr_cpu_ids,
> causing cpu to be set to 2048 in my case.
> 
> select_task_rq() then does the check to see if it should use a fallback
> rq:
> 
> if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||   
>   
>
>!cpu_online(cpu)))
>   cpu = select_fallback_rq(task_cpu(p), p);
> 
> 
> But in both those checks we end up indexing off the end of the cpu mask,
> because cpu is >= nr_cpu_ids. At least on my system they both return
> true and so we return cpu == 2048.
> 
> The patch above is pretty clearly not the right fix, though maybe it's a
> good safety measure.
> 
> Presumably we shouldn't be ending up with tsk_cpus_allowed() being
> empty, but I haven't had time to track down why that's happening.
> 
> cheers
> 

+peterz

FYI: I see the samething on my cpu as well, its just that I get lucky
and cpu_online(cpu) returns false.

I think from a functional perspective we may want to get some additional
debug checks for places where the cpumask is empty early during boot.

Looks like there is a dependency between cpumasks and cpus coming online.
I wonder if we can hit similar issues during hotplug

FWIW, your patch looks correct to me, though one might argue that
cpumask_test_cpu() is a better place to fix it

Balbir Singh


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-11 Thread Balbir Singh


On 11/10/16 22:22, Michael Ellerman wrote:
> Tejun Heo  writes:
> 
>> Hello, Michael.
>>
>> On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote:
>>> This patch seems to be causing one of my Power8 boxes not to boot.
>>>
>>> Specifically commit 3347fa092821 ("workqueue: make workqueue available
>>> early during boot") in linux-next.
>>>
>>> If I revert this on top of next-20161005 then the machine boots again.
>>>
>>> I've attached the oops below. It looks like the cfs_rq of p->se is NULL?
>>
>> Hah, weird that it's arch dependent, or maybe it's just different
>> config options.  Most likely, it's caused by workqueue_init() call
>> being moved too early.  Can you please try the following patch and see
>> whether the problem goes away?
> 
> No that doesn't help.
> 
> What does is this:
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 94732d1ab00a..4e79549d242f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1614,7 +1614,8 @@ int select_task_rq(struct task_struct *p, int cpu, int 
> sd_flags, int wake_flags)
>* [ this allows ->select_task() to simply return task_cpu(p) and
>*   not worry about this generic constraint ]
>*/
> - if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
> + if (unlikely(cpu >= nr_cpu_ids ||
> +  !cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
>!cpu_online(cpu)))
>   cpu = select_fallback_rq(task_cpu(p), p);
>  
> 
> The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
> is NULL.
> 
> The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
> NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
> set_task_rq() and happen to get NULL.
> 
> We never should have done set_task_rq(p, 2048), because 2048 is >=
> nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
> doesn't cope with that.
> 
> The reason we're calling set_task_rq() with CPU 2048 is because
> in select_task_rq() we had tsk_nr_cpus_allowed() = 0, because
> tsk_cpus_allowed(p) is an empty cpu mask.
> 
> That means we do in select_task_rq():
>   cpu = cpumask_any(tsk_cpus_allowed(p)); 
>   
>  
> 
> And when tsk_cpus_allowed(p) is empty cpumask_any() returns nr_cpu_ids,
> causing cpu to be set to 2048 in my case.
> 
> select_task_rq() then does the check to see if it should use a fallback
> rq:
> 
> if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||   
>   
>
>!cpu_online(cpu)))
>   cpu = select_fallback_rq(task_cpu(p), p);
> 
> 
> But in both those checks we end up indexing off the end of the cpu mask,
> because cpu is >= nr_cpu_ids. At least on my system they both return
> true and so we return cpu == 2048.
> 
> The patch above is pretty clearly not the right fix, though maybe it's a
> good safety measure.
> 
> Presumably we shouldn't be ending up with tsk_cpus_allowed() being
> empty, but I haven't had time to track down why that's happening.
> 
> cheers
> 

+peterz

FYI: I see the samething on my cpu as well, its just that I get lucky
and cpu_online(cpu) returns false.

I think from a functional perspective we may want to get some additional
debug checks for places where the cpumask is empty early during boot.

Looks like there is a dependency between cpumasks and cpus coming online.
I wonder if we can hit similar issues during hotplug

FWIW, your patch looks correct to me, though one might argue that
cpumask_test_cpu() is a better place to fix it

Balbir Singh


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-11 Thread Michael Ellerman
Tejun Heo  writes:

> Hello, Michael.
>
> On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote:
>> This patch seems to be causing one of my Power8 boxes not to boot.
>> 
>> Specifically commit 3347fa092821 ("workqueue: make workqueue available
>> early during boot") in linux-next.
>> 
>> If I revert this on top of next-20161005 then the machine boots again.
>> 
>> I've attached the oops below. It looks like the cfs_rq of p->se is NULL?
>
> Hah, weird that it's arch dependent, or maybe it's just different
> config options.  Most likely, it's caused by workqueue_init() call
> being moved too early.  Can you please try the following patch and see
> whether the problem goes away?

No that doesn't help.

What does is this:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 94732d1ab00a..4e79549d242f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1614,7 +1614,8 @@ int select_task_rq(struct task_struct *p, int cpu, int 
sd_flags, int wake_flags)
 * [ this allows ->select_task() to simply return task_cpu(p) and
 *   not worry about this generic constraint ]
 */
-   if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
+   if (unlikely(cpu >= nr_cpu_ids ||
+!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
 !cpu_online(cpu)))
cpu = select_fallback_rq(task_cpu(p), p);
 

The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
is NULL.

The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
set_task_rq() and happen to get NULL.

We never should have done set_task_rq(p, 2048), because 2048 is >=
nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
doesn't cope with that.

The reason we're calling set_task_rq() with CPU 2048 is because
in select_task_rq() we had tsk_nr_cpus_allowed() = 0, because
tsk_cpus_allowed(p) is an empty cpu mask.

That means we do in select_task_rq():
  cpu = cpumask_any(tsk_cpus_allowed(p));   

 

And when tsk_cpus_allowed(p) is empty cpumask_any() returns nr_cpu_ids,
causing cpu to be set to 2048 in my case.

select_task_rq() then does the check to see if it should use a fallback
rq:

if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) || 

   
 !cpu_online(cpu)))
cpu = select_fallback_rq(task_cpu(p), p);


But in both those checks we end up indexing off the end of the cpu mask,
because cpu is >= nr_cpu_ids. At least on my system they both return
true and so we return cpu == 2048.

The patch above is pretty clearly not the right fix, though maybe it's a
good safety measure.

Presumably we shouldn't be ending up with tsk_cpus_allowed() being
empty, but I haven't had time to track down why that's happening.

cheers


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-11 Thread Michael Ellerman
Tejun Heo  writes:

> Hello, Michael.
>
> On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote:
>> This patch seems to be causing one of my Power8 boxes not to boot.
>> 
>> Specifically commit 3347fa092821 ("workqueue: make workqueue available
>> early during boot") in linux-next.
>> 
>> If I revert this on top of next-20161005 then the machine boots again.
>> 
>> I've attached the oops below. It looks like the cfs_rq of p->se is NULL?
>
> Hah, weird that it's arch dependent, or maybe it's just different
> config options.  Most likely, it's caused by workqueue_init() call
> being moved too early.  Can you please try the following patch and see
> whether the problem goes away?

No that doesn't help.

What does is this:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 94732d1ab00a..4e79549d242f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1614,7 +1614,8 @@ int select_task_rq(struct task_struct *p, int cpu, int 
sd_flags, int wake_flags)
 * [ this allows ->select_task() to simply return task_cpu(p) and
 *   not worry about this generic constraint ]
 */
-   if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
+   if (unlikely(cpu >= nr_cpu_ids ||
+!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) ||
 !cpu_online(cpu)))
cpu = select_fallback_rq(task_cpu(p), p);
 

The oops happens because we're in enqueue_task_fair() and p->se->cfs_rq
is NULL.

The cfs_rq is NULL because we did set_task_rq(p, 2048), where 2048 is
NR_CPUS. That causes us to index past the end of the tg->cfs_rq array in
set_task_rq() and happen to get NULL.

We never should have done set_task_rq(p, 2048), because 2048 is >=
nr_cpu_ids, which means it's not a valid CPU number, and set_task_rq()
doesn't cope with that.

The reason we're calling set_task_rq() with CPU 2048 is because
in select_task_rq() we had tsk_nr_cpus_allowed() = 0, because
tsk_cpus_allowed(p) is an empty cpu mask.

That means we do in select_task_rq():
  cpu = cpumask_any(tsk_cpus_allowed(p));   

 

And when tsk_cpus_allowed(p) is empty cpumask_any() returns nr_cpu_ids,
causing cpu to be set to 2048 in my case.

select_task_rq() then does the check to see if it should use a fallback
rq:

if (unlikely(!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) || 

   
 !cpu_online(cpu)))
cpu = select_fallback_rq(task_cpu(p), p);


But in both those checks we end up indexing off the end of the cpu mask,
because cpu is >= nr_cpu_ids. At least on my system they both return
true and so we return cpu == 2048.

The patch above is pretty clearly not the right fix, though maybe it's a
good safety measure.

Presumably we shouldn't be ending up with tsk_cpus_allowed() being
empty, but I haven't had time to track down why that's happening.

cheers


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-10 Thread Balbir Singh


On 10/10/16 23:53, Tejun Heo wrote:
> On Mon, Oct 10, 2016 at 10:17:16PM +1100, Balbir Singh wrote:
>> rest_init()
>> {
>>  ...
>>  kernel_thread(kernel_init, NULL, CLONE_FS);
>>  numa_default_policy();
>>  pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
>>  rcu_read_lock();
>>  kthreadd_task = find_task_by_pid_ns(pid, _pid_ns);
>>  ...
>>
>> }
>>
>> create_worker() needs kthreadd, it wakes up kthreadd in 
>> kthread_create_on_node,
>> workqueue_init() is called from kernel_init() , but kthreadd is created after
>> the call to kernel_init(), so its touch and go
> 
> But the first thing kernel_init_freeable() does is
> wait_for_completion(_done).
> 

Yes, Of course, looking at the stack trace again, it was not the wake_up itself,
but the absence of cfs_rq of p->se that caused the issue. Will try and chase it
down. Quick look shows cgroup_init() has occurred before workqueue_init(), so
ideally p->se.cfs_rq should be allocated.

Sorry for the noise,

Balbir


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-10 Thread Balbir Singh


On 10/10/16 23:53, Tejun Heo wrote:
> On Mon, Oct 10, 2016 at 10:17:16PM +1100, Balbir Singh wrote:
>> rest_init()
>> {
>>  ...
>>  kernel_thread(kernel_init, NULL, CLONE_FS);
>>  numa_default_policy();
>>  pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
>>  rcu_read_lock();
>>  kthreadd_task = find_task_by_pid_ns(pid, _pid_ns);
>>  ...
>>
>> }
>>
>> create_worker() needs kthreadd, it wakes up kthreadd in 
>> kthread_create_on_node,
>> workqueue_init() is called from kernel_init() , but kthreadd is created after
>> the call to kernel_init(), so its touch and go
> 
> But the first thing kernel_init_freeable() does is
> wait_for_completion(_done).
> 

Yes, Of course, looking at the stack trace again, it was not the wake_up itself,
but the absence of cfs_rq of p->se that caused the issue. Will try and chase it
down. Quick look shows cgroup_init() has occurred before workqueue_init(), so
ideally p->se.cfs_rq should be allocated.

Sorry for the noise,

Balbir


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-10 Thread Tejun Heo
On Mon, Oct 10, 2016 at 09:02:53AM -0400, Tejun Heo wrote:
> Hello, Michael.
> 
> On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote:
> > This patch seems to be causing one of my Power8 boxes not to boot.
> > 
> > Specifically commit 3347fa092821 ("workqueue: make workqueue available
> > early during boot") in linux-next.
> > 
> > If I revert this on top of next-20161005 then the machine boots again.
> > 
> > I've attached the oops below. It looks like the cfs_rq of p->se is NULL?
> 
> Hah, weird that it's arch dependent, or maybe it's just different
> config options.  Most likely, it's caused by workqueue_init() call
> being moved too early.  Can you please try the following patch and see
> whether the problem goes away?

And if it does, can you please post the boot log with kernel boot
option "initcall_debug=1" specified?

Thanks.

-- 
tejun


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-10 Thread Tejun Heo
On Mon, Oct 10, 2016 at 09:02:53AM -0400, Tejun Heo wrote:
> Hello, Michael.
> 
> On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote:
> > This patch seems to be causing one of my Power8 boxes not to boot.
> > 
> > Specifically commit 3347fa092821 ("workqueue: make workqueue available
> > early during boot") in linux-next.
> > 
> > If I revert this on top of next-20161005 then the machine boots again.
> > 
> > I've attached the oops below. It looks like the cfs_rq of p->se is NULL?
> 
> Hah, weird that it's arch dependent, or maybe it's just different
> config options.  Most likely, it's caused by workqueue_init() call
> being moved too early.  Can you please try the following patch and see
> whether the problem goes away?

And if it does, can you please post the boot log with kernel boot
option "initcall_debug=1" specified?

Thanks.

-- 
tejun


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-10 Thread Tejun Heo
Hello, Michael.

On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote:
> This patch seems to be causing one of my Power8 boxes not to boot.
> 
> Specifically commit 3347fa092821 ("workqueue: make workqueue available
> early during boot") in linux-next.
> 
> If I revert this on top of next-20161005 then the machine boots again.
> 
> I've attached the oops below. It looks like the cfs_rq of p->se is NULL?

Hah, weird that it's arch dependent, or maybe it's just different
config options.  Most likely, it's caused by workqueue_init() call
being moved too early.  Can you please try the following patch and see
whether the problem goes away?

Thanks.

diff --git a/init/main.c b/init/main.c
index 5c4fd68..fe4fa47 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1013,7 +1013,7 @@ static noinline void __init kernel_init_freeable(void)
 
smp_prepare_cpus(setup_max_cpus);
 
-   workqueue_init();
+   //workqueue_init();
 
do_pre_smp_initcalls();
lockup_detector_init();
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ad0cd43..400f5e2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5585,3 +5585,4 @@ int __init workqueue_init(void)
 
return 0;
 }
+early_initcall(workqueue_init);



Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-10 Thread Tejun Heo
Hello, Michael.

On Mon, Oct 10, 2016 at 09:22:55PM +1100, Michael Ellerman wrote:
> This patch seems to be causing one of my Power8 boxes not to boot.
> 
> Specifically commit 3347fa092821 ("workqueue: make workqueue available
> early during boot") in linux-next.
> 
> If I revert this on top of next-20161005 then the machine boots again.
> 
> I've attached the oops below. It looks like the cfs_rq of p->se is NULL?

Hah, weird that it's arch dependent, or maybe it's just different
config options.  Most likely, it's caused by workqueue_init() call
being moved too early.  Can you please try the following patch and see
whether the problem goes away?

Thanks.

diff --git a/init/main.c b/init/main.c
index 5c4fd68..fe4fa47 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1013,7 +1013,7 @@ static noinline void __init kernel_init_freeable(void)
 
smp_prepare_cpus(setup_max_cpus);
 
-   workqueue_init();
+   //workqueue_init();
 
do_pre_smp_initcalls();
lockup_detector_init();
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ad0cd43..400f5e2 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5585,3 +5585,4 @@ int __init workqueue_init(void)
 
return 0;
 }
+early_initcall(workqueue_init);



Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-10 Thread Tejun Heo
On Mon, Oct 10, 2016 at 10:17:16PM +1100, Balbir Singh wrote:
> rest_init()
> {
>   ...
>   kernel_thread(kernel_init, NULL, CLONE_FS);
>   numa_default_policy();
>   pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
>   rcu_read_lock();
>   kthreadd_task = find_task_by_pid_ns(pid, _pid_ns);
>   ...
> 
> }
> 
> create_worker() needs kthreadd, it wakes up kthreadd in 
> kthread_create_on_node,
> workqueue_init() is called from kernel_init() , but kthreadd is created after
> the call to kernel_init(), so its touch and go

But the first thing kernel_init_freeable() does is
wait_for_completion(_done).

Thanks.

-- 
tejun


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-10 Thread Tejun Heo
On Mon, Oct 10, 2016 at 10:17:16PM +1100, Balbir Singh wrote:
> rest_init()
> {
>   ...
>   kernel_thread(kernel_init, NULL, CLONE_FS);
>   numa_default_policy();
>   pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
>   rcu_read_lock();
>   kthreadd_task = find_task_by_pid_ns(pid, _pid_ns);
>   ...
> 
> }
> 
> create_worker() needs kthreadd, it wakes up kthreadd in 
> kthread_create_on_node,
> workqueue_init() is called from kernel_init() , but kthreadd is created after
> the call to kernel_init(), so its touch and go

But the first thing kernel_init_freeable() does is
wait_for_completion(_done).

Thanks.

-- 
tejun


Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-10 Thread Balbir Singh


On 10/10/16 21:22, Michael Ellerman wrote:
> Hi Tejun,
> 
> Tejun Heo  writes:
>> From f85002f627f7fdc7b3cda526863f5c9a8d36b997 Mon Sep 17 00:00:00 2001
>> From: Tejun Heo 
>> Date: Fri, 16 Sep 2016 15:49:32 -0400
>> Subject: [PATCH] workqueue: make workqueue available early during boot
>>
>> Workqueue is currently initialized in an early init call; however,
>> there are cases where early boot code has to be split and reordered to
>> come after workqueue initialization or the same code path which makes
>> use of workqueues is used both before workqueue initailization and
>> after.  The latter cases have to gate workqueue usages with
>> keventd_up() tests, which is nasty and easy to get wrong.
>>
>> Workqueue usages have become widespread and it'd be a lot more
>> convenient if it can be used very early from boot.  This patch splits
>> workqueue initialization into two steps.  workqueue_init_early() which
>> sets up the basic data structures so that workqueues can be created
>> and work items queued, and workqueue_init() which actually brings up
>> workqueues online and starts executing queued work items.  The former
>> step can be done very early during boot once memory allocation,
>> cpumasks and idr are initialized.  The latter right after kthreads
>> become available.
>>
>> This allows work item queueing and canceling from very early boot
>> which is what most of these use cases want.
>>
>> * As systemd_wq being initialized doesn't indicate that workqueue is
>>   fully online anymore, update keventd_up() to test wq_online instead.
>>   The follow-up patches will get rid of all its usages and the
>>   function itself.
>>
>> * Flushing doesn't make sense before workqueue is fully initialized.
>>   The flush functions trigger WARN and return immediately before fully
>>   online.
>>
>> * Work items are never in-flight before fully online.  Canceling can
>>   always succeed by skipping the flush step.
>>
>> * Some code paths can no longer assume to be called with irq enabled
>>   as irq is disabled during early boot.  Use irqsave/restore
>>   operations instead.
>>
>> v2: Watchdog init, which requires timer to be running, moved from
>> workqueue_init_early() to workqueue_init().
>>
>> Signed-off-by: Tejun Heo 
>> Suggested-by: Linus Torvalds 
>> Link: 
>> http://lkml.kernel.org/r/ca+55afx0vpumuxn00rbsm192n-du5uxy+4avka0sbsovjeu...@mail.gmail.com
> 
> 
> This patch seems to be causing one of my Power8 boxes not to boot.
> 
> Specifically commit 3347fa092821 ("workqueue: make workqueue available
> early during boot") in linux-next.
> 
> If I revert this on top of next-20161005 then the machine boots again.
> 
> I've attached the oops below. It looks like the cfs_rq of p->se is NULL?
> 
> cheers
> 
> 
> bootconsole [udbg0] disabled
> bootconsole [udbg0] disabled
> mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= 
> or the kernel.numa_balancing sysctl
> pid_max: default: 163840 minimum: 1280
> Dentry cache hash table entries: 16777216 (order: 11, 134217728 bytes)
> Inode-cache hash table entries: 8388608 (order: 10, 67108864 bytes)
> Mount-cache hash table entries: 262144 (order: 5, 2097152 bytes)
> Mountpoint-cache hash table entries: 262144 (order: 5, 2097152 bytes)
> Unable to handle kernel paging request for data at address 0x0038
> Faulting instruction address: 0xc00fc0cc
> Oops: Kernel access of bad area, sig: 11 [#1]
> SMP NR_CPUS=2048 NUMA PowerNV
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
> task: c007f540 task.stack: c01ffc084000
> NIP: c00fc0cc LR: c00ed928 CTR: c00fbfd0
> REGS: c01ffc087780 TRAP: 0300   Not tainted  
> (4.8.0-compiler_gcc-6.2.0-next-20161005)
> MSR: 92009033   CR: 48000424  XER: 
> 
> CFAR: c00089dc DAR: 0038 DSISR: 4000 SOFTE: 0 
> GPR00: c00ed928 c01ffc087a00 c0e63200 c00010d6d600 
> GPR04: c007f5409200 0021 0748e08c 001f 
> GPR08:  0021 0748f1f8  
> GPR12: 28000422 cfb8 c000e0c8  
> GPR16:   0021 0001 
> GPR20: afb50401  c00010d6d600 ba7e 
> GPR24: ba7e c0d8bc58 afb504000afb5041 0001 
> GPR28:  0004 c007f5409280  
> NIP [c00fc0cc] enqueue_task_fair+0xfc/0x18b0
> LR [c00ed928] activate_task+0x78/0xe0
> Call Trace:
> [c01ffc087a00] [c007f5409200] 0xc007f5409200 (unreliable)
> [c01ffc087b10] [c00ed928] activate_task+0x78/0xe0
> [c01ffc087b50] [c00ede58] ttwu_do_activate+0x68/0xc0
> [c01ffc087b90] 

Re: Oops on Power8 (was Re: [PATCH v2 1/7] workqueue: make workqueue available early during boot)

2016-10-10 Thread Balbir Singh


On 10/10/16 21:22, Michael Ellerman wrote:
> Hi Tejun,
> 
> Tejun Heo  writes:
>> From f85002f627f7fdc7b3cda526863f5c9a8d36b997 Mon Sep 17 00:00:00 2001
>> From: Tejun Heo 
>> Date: Fri, 16 Sep 2016 15:49:32 -0400
>> Subject: [PATCH] workqueue: make workqueue available early during boot
>>
>> Workqueue is currently initialized in an early init call; however,
>> there are cases where early boot code has to be split and reordered to
>> come after workqueue initialization or the same code path which makes
>> use of workqueues is used both before workqueue initailization and
>> after.  The latter cases have to gate workqueue usages with
>> keventd_up() tests, which is nasty and easy to get wrong.
>>
>> Workqueue usages have become widespread and it'd be a lot more
>> convenient if it can be used very early from boot.  This patch splits
>> workqueue initialization into two steps.  workqueue_init_early() which
>> sets up the basic data structures so that workqueues can be created
>> and work items queued, and workqueue_init() which actually brings up
>> workqueues online and starts executing queued work items.  The former
>> step can be done very early during boot once memory allocation,
>> cpumasks and idr are initialized.  The latter right after kthreads
>> become available.
>>
>> This allows work item queueing and canceling from very early boot
>> which is what most of these use cases want.
>>
>> * As systemd_wq being initialized doesn't indicate that workqueue is
>>   fully online anymore, update keventd_up() to test wq_online instead.
>>   The follow-up patches will get rid of all its usages and the
>>   function itself.
>>
>> * Flushing doesn't make sense before workqueue is fully initialized.
>>   The flush functions trigger WARN and return immediately before fully
>>   online.
>>
>> * Work items are never in-flight before fully online.  Canceling can
>>   always succeed by skipping the flush step.
>>
>> * Some code paths can no longer assume to be called with irq enabled
>>   as irq is disabled during early boot.  Use irqsave/restore
>>   operations instead.
>>
>> v2: Watchdog init, which requires timer to be running, moved from
>> workqueue_init_early() to workqueue_init().
>>
>> Signed-off-by: Tejun Heo 
>> Suggested-by: Linus Torvalds 
>> Link: 
>> http://lkml.kernel.org/r/ca+55afx0vpumuxn00rbsm192n-du5uxy+4avka0sbsovjeu...@mail.gmail.com
> 
> 
> This patch seems to be causing one of my Power8 boxes not to boot.
> 
> Specifically commit 3347fa092821 ("workqueue: make workqueue available
> early during boot") in linux-next.
> 
> If I revert this on top of next-20161005 then the machine boots again.
> 
> I've attached the oops below. It looks like the cfs_rq of p->se is NULL?
> 
> cheers
> 
> 
> bootconsole [udbg0] disabled
> bootconsole [udbg0] disabled
> mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= 
> or the kernel.numa_balancing sysctl
> pid_max: default: 163840 minimum: 1280
> Dentry cache hash table entries: 16777216 (order: 11, 134217728 bytes)
> Inode-cache hash table entries: 8388608 (order: 10, 67108864 bytes)
> Mount-cache hash table entries: 262144 (order: 5, 2097152 bytes)
> Mountpoint-cache hash table entries: 262144 (order: 5, 2097152 bytes)
> Unable to handle kernel paging request for data at address 0x0038
> Faulting instruction address: 0xc00fc0cc
> Oops: Kernel access of bad area, sig: 11 [#1]
> SMP NR_CPUS=2048 NUMA PowerNV
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 4.8.0-compiler_gcc-6.2.0-next-20161005 #94
> task: c007f540 task.stack: c01ffc084000
> NIP: c00fc0cc LR: c00ed928 CTR: c00fbfd0
> REGS: c01ffc087780 TRAP: 0300   Not tainted  
> (4.8.0-compiler_gcc-6.2.0-next-20161005)
> MSR: 92009033   CR: 48000424  XER: 
> 
> CFAR: c00089dc DAR: 0038 DSISR: 4000 SOFTE: 0 
> GPR00: c00ed928 c01ffc087a00 c0e63200 c00010d6d600 
> GPR04: c007f5409200 0021 0748e08c 001f 
> GPR08:  0021 0748f1f8  
> GPR12: 28000422 cfb8 c000e0c8  
> GPR16:   0021 0001 
> GPR20: afb50401  c00010d6d600 ba7e 
> GPR24: ba7e c0d8bc58 afb504000afb5041 0001 
> GPR28:  0004 c007f5409280  
> NIP [c00fc0cc] enqueue_task_fair+0xfc/0x18b0
> LR [c00ed928] activate_task+0x78/0xe0
> Call Trace:
> [c01ffc087a00] [c007f5409200] 0xc007f5409200 (unreliable)
> [c01ffc087b10] [c00ed928] activate_task+0x78/0xe0
> [c01ffc087b50] [c00ede58] ttwu_do_activate+0x68/0xc0
> [c01ffc087b90] [c00ef1b8] try_to_wake_up+0x208/0x4f0
> [c01ffc087c10] [c00d3484] create_worker+0x144/0x250
>