Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On 11/14/2014 09:38 PM, Sasha Levin wrote: > On 11/10/2014 11:03 AM, Peter Zijlstra wrote: >> > On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote: >> > [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 >> > [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: >> > trinity-c594/11067, .owner_cpu: 13 >> > Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task. >> > >> > One of those again :/ > This actually reproduces pretty easily. The stack trace seems to be different > but the end result is the same as above. Anything we can do to debug it? I'm > really not sure how just the owner_cpu can be different here. I've added saving stack traces on raw spinlock lock/unlock. Had to keep it at 8 entries, but it's enough to get the gist of it. The trace I'm seeing is: [ 1239.788220] BUG: spinlock recursion on CPU#0, trinity-c767/32076 [ 1239.788270] irq event stamp: 9135954 [ 1239.788455] hardirqs last enabled at (9135953): free_pages_prepare (mm/page_alloc.c:770) [ 1239.788573] hardirqs last disabled at (9135954): _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:109 kernel/locking/spinlock.c:159) [ 1239.788687] softirqs last enabled at (9134166): __do_softirq (./arch/x86/include/asm/preempt.h:22 kernel/softirq.c:296) [ 1239.788796] softirqs last disabled at (9134161): irq_exit (kernel/softirq.c:346 kernel/softirq.c:387) [ 1239.791152] lock: 0x8805c3dd9100, .magic: dead4ead, .owner: trinity-c767/32076, .owner_cpu: -1 [ 1239.791152] lock trace: [ 1239.791152] save_stack_trace (arch/x86/kernel/stacktrace.c:64) [ 1239.791152] do_raw_spin_trylock (kernel/locking/spinlock_debug.c:171) [ 1239.791152] _raw_spin_lock_irq (include/linux/spinlock_api_smp.h:129 kernel/locking/spinlock.c:167) [ 1239.791152] __schedule (kernel/sched/core.c:2803) [ 1239.791152] schedule (kernel/sched/core.c:2870) [ 1239.791152] p9_client_rpc (net/9p/client.c:756 (discriminator 13)) [ 1239.791152] p9_client_read (net/9p/client.c:1582) [ 1239.791152] v9fs_fid_readn (fs/9p/vfs_file.c:386) [ 1239.791152] unlock trace: [ 1239.791152] save_stack_trace (arch/x86/kernel/stacktrace.c:64) [ 1239.791152] do_raw_spin_unlock (kernel/locking/spinlock_debug.c:120 include/linux/jump_label.h:114 ./arch/x86/include/asm/spinlock.h:149 kernel/locking/spinlock_debug.c:176) [ 1239.791152] _raw_spin_unlock_irq (include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199) [ 1239.791152] finish_task_switch (./arch/x86/include/asm/current.h:14 kernel/sched/core.c:2251) [ 1239.791152] __schedule (kernel/sched/core.c:2358 kernel/sched/core.c:2840) [ 1239.791152] schedule_preempt_disabled (kernel/sched/core.c:2897) [ 1239.791152] cpu_startup_entry (kernel/sched/idle.c:252 kernel/sched/idle.c:274) [ 1239.791152] start_secondary (arch/x86/kernel/smpboot.c:238) [ 1239.791152] CPU: 0 PID: 32076 Comm: trinity-c767 Not tainted 3.18.0-rc4-next-20141117-sasha-00046-g481e3e8-dirty #1471 [ 1239.791152] 88089a90b000 880867223728 [ 1239.791152] 920d1ec1 0030 8805c3dd9100 880867223768 [ 1239.791152] 815d0617 9eec7730 88089a90bf58 8805c3dd9100 [ 1239.791152] Call Trace: [ 1239.791152] dump_stack (lib/dump_stack.c:52) [ 1239.791152] spin_dump (kernel/locking/spinlock_debug.c:81 (discriminator 8)) [ 1239.791152] spin_bug (kernel/locking/spinlock_debug.c:89) [ 1239.791152] do_raw_spin_lock (kernel/locking/spinlock_debug.c:97 kernel/locking/spinlock_debug.c:152) [ 1239.791152] ? load_balance (kernel/sched/fair.c:5582 kernel/sched/fair.c:6884) [ 1239.791152] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:117 kernel/locking/spinlock.c:159) [ 1239.791152] ? load_balance (kernel/sched/fair.c:5582 kernel/sched/fair.c:6884) [ 1239.791152] load_balance (kernel/sched/fair.c:5582 kernel/sched/fair.c:6884) [ 1239.791152] pick_next_task_fair (kernel/sched/fair.c:7154 kernel/sched/fair.c:5093) [ 1239.791152] ? pick_next_task_fair (kernel/sched/fair.c:7131 kernel/sched/fair.c:5093) [ 1239.791152] __schedule (kernel/sched/core.c:2716 kernel/sched/core.c:2830) [ 1239.791152] ? preempt_count_sub (kernel/sched/core.c:2644) [ 1239.791152] schedule (kernel/sched/core.c:2870) [ 1239.791152] p9_client_rpc (net/9p/client.c:756 (discriminator 13)) [ 1239.791152] ? __init_waitqueue_head (kernel/sched/wait.c:292) [ 1239.791152] ? p9_idpool_put (net/9p/util.c:127) [ 1239.791152] p9_client_read (net/9p/client.c:1582) [ 1239.791152] ? p9_free_req (net/9p/client.c:410) [ 1239.791152] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3734) [ 1239.791152] ? might_fault (mm/memory.c:3734) [ 1239.791152] v9fs_fid_readn (fs/9p/vfs_file.c:386) [ 1239.791152] v9fs_file_read (fs/9p/vfs_file.c:447) [ 1239.791152] do_loop_readv_writev (fs/read_write.c:724) [ 1239.791152] ? v9fs_fid_readn (fs/9p/vfs_file.c:433) [ 1239.791152] ? v9fs_fid_readn (fs/9p/vfs_file.c:433) [ 1239.791152] do_readv_writev
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On 11/14/2014 09:38 PM, Sasha Levin wrote: On 11/10/2014 11:03 AM, Peter Zijlstra wrote: On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote: [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13 Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task. One of those again :/ This actually reproduces pretty easily. The stack trace seems to be different but the end result is the same as above. Anything we can do to debug it? I'm really not sure how just the owner_cpu can be different here. I've added saving stack traces on raw spinlock lock/unlock. Had to keep it at 8 entries, but it's enough to get the gist of it. The trace I'm seeing is: [ 1239.788220] BUG: spinlock recursion on CPU#0, trinity-c767/32076 [ 1239.788270] irq event stamp: 9135954 [ 1239.788455] hardirqs last enabled at (9135953): free_pages_prepare (mm/page_alloc.c:770) [ 1239.788573] hardirqs last disabled at (9135954): _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:109 kernel/locking/spinlock.c:159) [ 1239.788687] softirqs last enabled at (9134166): __do_softirq (./arch/x86/include/asm/preempt.h:22 kernel/softirq.c:296) [ 1239.788796] softirqs last disabled at (9134161): irq_exit (kernel/softirq.c:346 kernel/softirq.c:387) [ 1239.791152] lock: 0x8805c3dd9100, .magic: dead4ead, .owner: trinity-c767/32076, .owner_cpu: -1 [ 1239.791152] lock trace: [ 1239.791152] save_stack_trace (arch/x86/kernel/stacktrace.c:64) [ 1239.791152] do_raw_spin_trylock (kernel/locking/spinlock_debug.c:171) [ 1239.791152] _raw_spin_lock_irq (include/linux/spinlock_api_smp.h:129 kernel/locking/spinlock.c:167) [ 1239.791152] __schedule (kernel/sched/core.c:2803) [ 1239.791152] schedule (kernel/sched/core.c:2870) [ 1239.791152] p9_client_rpc (net/9p/client.c:756 (discriminator 13)) [ 1239.791152] p9_client_read (net/9p/client.c:1582) [ 1239.791152] v9fs_fid_readn (fs/9p/vfs_file.c:386) [ 1239.791152] unlock trace: [ 1239.791152] save_stack_trace (arch/x86/kernel/stacktrace.c:64) [ 1239.791152] do_raw_spin_unlock (kernel/locking/spinlock_debug.c:120 include/linux/jump_label.h:114 ./arch/x86/include/asm/spinlock.h:149 kernel/locking/spinlock_debug.c:176) [ 1239.791152] _raw_spin_unlock_irq (include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199) [ 1239.791152] finish_task_switch (./arch/x86/include/asm/current.h:14 kernel/sched/core.c:2251) [ 1239.791152] __schedule (kernel/sched/core.c:2358 kernel/sched/core.c:2840) [ 1239.791152] schedule_preempt_disabled (kernel/sched/core.c:2897) [ 1239.791152] cpu_startup_entry (kernel/sched/idle.c:252 kernel/sched/idle.c:274) [ 1239.791152] start_secondary (arch/x86/kernel/smpboot.c:238) [ 1239.791152] CPU: 0 PID: 32076 Comm: trinity-c767 Not tainted 3.18.0-rc4-next-20141117-sasha-00046-g481e3e8-dirty #1471 [ 1239.791152] 88089a90b000 880867223728 [ 1239.791152] 920d1ec1 0030 8805c3dd9100 880867223768 [ 1239.791152] 815d0617 9eec7730 88089a90bf58 8805c3dd9100 [ 1239.791152] Call Trace: [ 1239.791152] dump_stack (lib/dump_stack.c:52) [ 1239.791152] spin_dump (kernel/locking/spinlock_debug.c:81 (discriminator 8)) [ 1239.791152] spin_bug (kernel/locking/spinlock_debug.c:89) [ 1239.791152] do_raw_spin_lock (kernel/locking/spinlock_debug.c:97 kernel/locking/spinlock_debug.c:152) [ 1239.791152] ? load_balance (kernel/sched/fair.c:5582 kernel/sched/fair.c:6884) [ 1239.791152] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:117 kernel/locking/spinlock.c:159) [ 1239.791152] ? load_balance (kernel/sched/fair.c:5582 kernel/sched/fair.c:6884) [ 1239.791152] load_balance (kernel/sched/fair.c:5582 kernel/sched/fair.c:6884) [ 1239.791152] pick_next_task_fair (kernel/sched/fair.c:7154 kernel/sched/fair.c:5093) [ 1239.791152] ? pick_next_task_fair (kernel/sched/fair.c:7131 kernel/sched/fair.c:5093) [ 1239.791152] __schedule (kernel/sched/core.c:2716 kernel/sched/core.c:2830) [ 1239.791152] ? preempt_count_sub (kernel/sched/core.c:2644) [ 1239.791152] schedule (kernel/sched/core.c:2870) [ 1239.791152] p9_client_rpc (net/9p/client.c:756 (discriminator 13)) [ 1239.791152] ? __init_waitqueue_head (kernel/sched/wait.c:292) [ 1239.791152] ? p9_idpool_put (net/9p/util.c:127) [ 1239.791152] p9_client_read (net/9p/client.c:1582) [ 1239.791152] ? p9_free_req (net/9p/client.c:410) [ 1239.791152] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3734) [ 1239.791152] ? might_fault (mm/memory.c:3734) [ 1239.791152] v9fs_fid_readn (fs/9p/vfs_file.c:386) [ 1239.791152] v9fs_file_read (fs/9p/vfs_file.c:447) [ 1239.791152] do_loop_readv_writev (fs/read_write.c:724) [ 1239.791152] ? v9fs_fid_readn (fs/9p/vfs_file.c:433) [ 1239.791152] ? v9fs_fid_readn (fs/9p/vfs_file.c:433) [ 1239.791152] do_readv_writev (fs/read_write.c:856) [ 1239.791152] ?
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On 11/10/2014 11:03 AM, Peter Zijlstra wrote: > On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote: >> > [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 >> > [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: >> > trinity-c594/11067, .owner_cpu: 13 > Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task. > > One of those again :/ This actually reproduces pretty easily. The stack trace seems to be different but the end result is the same as above. Anything we can do to debug it? I'm really not sure how just the owner_cpu can be different here. Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On 11/10/2014 11:03 AM, Peter Zijlstra wrote: On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote: [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13 Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task. One of those again :/ This actually reproduces pretty easily. The stack trace seems to be different but the end result is the same as above. Anything we can do to debug it? I'm really not sure how just the owner_cpu can be different here. Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
В Пн, 10/11/2014 в 23:01 +0300, Kirill Tkhai пишет: > > 10.11.2014, 19:45, "Sasha Levin" : > > On 11/10/2014 11:36 AM, Kirill Tkhai wrote: > >> I mean task_numa_find_cpu(). If a garbage is in > >> cpumask_of_node(env->dst_nid) > >> and cpu is bigger than mask, the check > >> > >> cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p) > >> > >> may be true. > >> > >> So, we dereference wrong rq in task_numa_compare(). It's not rq at all. > >> Strange cpu may be from here. It's just a int number in a wrong memory. > > > > But the odds of the spinlock magic and owner pointer matching up are slim > > to none in that case. The memory is also likely to be valid since KASAN > > didn't > > complain about the access, so I don't believe it to be an access to freed > > memory. > > I'm not good with lockdep checks, so I can't judge right now... > Just a hypothesis. > > >> A hypothesis that below may help: > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index 826fdf3..a2b4a8a 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -1376,6 +1376,9 @@ static void task_numa_find_cpu(struct task_numa_env > >> *env, > >> { > >> int cpu; > >> > >> + if (!node_online(env->dst_nid)) > >> + return; > > > > I've changed that to BUG_ON(!node_online(env->dst_nid)) and will run it for > > a > > bit. > > I've looked one more time, and it looks like it's better to check for > BUG_ON(env->dst_nid > MAX_NUMNODES). node_online() may do not work for > insane nids. > > Anyway, even if it's not connected, we need to initialize numa_preferred_nid > of init_task, because it's inherited by kernel_init() (and /sbin/init too). > I'll send the patch tomorrow. Also we can check for cpu. A problem may be in how nodes are built etc.. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 826fdf3..8f5c316 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1381,6 +1381,14 @@ static void task_numa_find_cpu(struct task_numa_env *env, if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p))) continue; + /* +* Is cpumask_of_node() broken? In sched_init() we +* initialize only possible RQs (their rq->lock etc). +*/ + BUG_ON(cpu >= NR_CPUS || !cpu_possible(cpu)); + /* Insane node id? */ + BUG_ON(env->dst_nid >= MAX_NUMNODES || !node_online(env->dst_nid)); + env->dst_cpu = cpu; task_numa_compare(env, taskimp, groupimp); } -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
В Пн, 10/11/2014 в 23:01 +0300, Kirill Tkhai пишет: 10.11.2014, 19:45, Sasha Levin sasha.le...@oracle.com: On 11/10/2014 11:36 AM, Kirill Tkhai wrote: I mean task_numa_find_cpu(). If a garbage is in cpumask_of_node(env-dst_nid) and cpu is bigger than mask, the check cpumask_test_cpu(cpu, tsk_cpus_allowed(env-p) may be true. So, we dereference wrong rq in task_numa_compare(). It's not rq at all. Strange cpu may be from here. It's just a int number in a wrong memory. But the odds of the spinlock magic and owner pointer matching up are slim to none in that case. The memory is also likely to be valid since KASAN didn't complain about the access, so I don't believe it to be an access to freed memory. I'm not good with lockdep checks, so I can't judge right now... Just a hypothesis. A hypothesis that below may help: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 826fdf3..a2b4a8a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1376,6 +1376,9 @@ static void task_numa_find_cpu(struct task_numa_env *env, { int cpu; + if (!node_online(env-dst_nid)) + return; I've changed that to BUG_ON(!node_online(env-dst_nid)) and will run it for a bit. I've looked one more time, and it looks like it's better to check for BUG_ON(env-dst_nid MAX_NUMNODES). node_online() may do not work for insane nids. Anyway, even if it's not connected, we need to initialize numa_preferred_nid of init_task, because it's inherited by kernel_init() (and /sbin/init too). I'll send the patch tomorrow. Also we can check for cpu. A problem may be in how nodes are built etc.. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 826fdf3..8f5c316 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1381,6 +1381,14 @@ static void task_numa_find_cpu(struct task_numa_env *env, if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env-p))) continue; + /* +* Is cpumask_of_node() broken? In sched_init() we +* initialize only possible RQs (their rq-lock etc). +*/ + BUG_ON(cpu = NR_CPUS || !cpu_possible(cpu)); + /* Insane node id? */ + BUG_ON(env-dst_nid = MAX_NUMNODES || !node_online(env-dst_nid)); + env-dst_cpu = cpu; task_numa_compare(env, taskimp, groupimp); } -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
10.11.2014, 19:45, "Sasha Levin" : > On 11/10/2014 11:36 AM, Kirill Tkhai wrote: >> I mean task_numa_find_cpu(). If a garbage is in >> cpumask_of_node(env->dst_nid) >> and cpu is bigger than mask, the check >> >> cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p) >> >> may be true. >> >> So, we dereference wrong rq in task_numa_compare(). It's not rq at all. >> Strange cpu may be from here. It's just a int number in a wrong memory. > > But the odds of the spinlock magic and owner pointer matching up are slim > to none in that case. The memory is also likely to be valid since KASAN didn't > complain about the access, so I don't believe it to be an access to freed > memory. I'm not good with lockdep checks, so I can't judge right now... Just a hypothesis. >> A hypothesis that below may help: >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 826fdf3..a2b4a8a 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -1376,6 +1376,9 @@ static void task_numa_find_cpu(struct task_numa_env >> *env, >> { >> int cpu; >> >> + if (!node_online(env->dst_nid)) >> + return; > > I've changed that to BUG_ON(!node_online(env->dst_nid)) and will run it for a > bit. I've looked one more time, and it looks like it's better to check for BUG_ON(env->dst_nid > MAX_NUMNODES). node_online() may do not work for insane nids. Anyway, even if it's not connected, we need to initialize numa_preferred_nid of init_task, because it's inherited by kernel_init() (and /sbin/init too). I'll send the patch tomorrow. Kirill -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On 11/10/2014 11:36 AM, Kirill Tkhai wrote: > I mean task_numa_find_cpu(). If a garbage is in cpumask_of_node(env->dst_nid) > and cpu is bigger than mask, the check > > cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p) > > may be true. > > So, we dereference wrong rq in task_numa_compare(). It's not rq at all. > Strange cpu may be from here. It's just a int number in a wrong memory. But the odds of the spinlock magic and owner pointer matching up are slim to none in that case. The memory is also likely to be valid since KASAN didn't complain about the access, so I don't believe it to be an access to freed memory. > > A hypothesis that below may help: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 826fdf3..a2b4a8a 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1376,6 +1376,9 @@ static void task_numa_find_cpu(struct task_numa_env > *env, > { > int cpu; > > + if (!node_online(env->dst_nid)) > + return; I've changed that to BUG_ON(!node_online(env->dst_nid)) and will run it for a bit. Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
В Пн, 10/11/2014 в 19:10 +0300, Kirill Tkhai пишет: > В Пн, 10/11/2014 в 17:03 +0100, Peter Zijlstra пишет: > > On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote: > > > [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 > > > [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: > > > trinity-c594/11067, .owner_cpu: 13 > > > > Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task. > > > > One of those again :/ > > We do not initialyse task_struct::numa_preferred_nid for INIT_TASK. > It there no a problem? > I mean task_numa_find_cpu(). If a garbage is in cpumask_of_node(env->dst_nid) and cpu is bigger than mask, the check cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p) may be true. So, we dereference wrong rq in task_numa_compare(). It's not rq at all. Strange cpu may be from here. It's just a int number in a wrong memory. A hypothesis that below may help: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 826fdf3..a2b4a8a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1376,6 +1376,9 @@ static void task_numa_find_cpu(struct task_numa_env *env, { int cpu; + if (!node_online(env->dst_nid)) + return; + for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) { /* Skip this CPU if the source task cannot migrate */ if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p))) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On Mon, Nov 10, 2014 at 11:09:29AM -0500, Sasha Levin wrote: > On 11/10/2014 11:03 AM, Peter Zijlstra wrote: > > On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote: > >> [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 > >> [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: > >> trinity-c594/11067, .owner_cpu: 13 > > > > Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task. > > > > One of those again :/ > > Hum. I missed that one. > > Hold on, but the magic here is fine and the owner pointer is fine, why would > just the owner cpu > be wrong? I've no clue, but I've seen them before. Complete mystery that. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
В Пн, 10/11/2014 в 17:03 +0100, Peter Zijlstra пишет: > On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote: > > [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 > > [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: > > trinity-c594/11067, .owner_cpu: 13 > > Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task. > > One of those again :/ We do not initialyse task_struct::numa_preferred_nid for INIT_TASK. It there no a problem? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On 11/10/2014 11:03 AM, Peter Zijlstra wrote: > On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote: >> [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 >> [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: >> trinity-c594/11067, .owner_cpu: 13 > > Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task. > > One of those again :/ Hum. I missed that one. Hold on, but the magic here is fine and the owner pointer is fine, why would just the owner cpu be wrong? Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote: > [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 > [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: > trinity-c594/11067, .owner_cpu: 13 Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task. One of those again :/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On Mon, Nov 10, 2014 at 10:48:28AM -0500, Sasha Levin wrote: > On 11/10/2014 05:03 AM, Peter Zijlstra wrote: > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -1268,6 +1268,13 @@ static void task_numa_compare(struct tas > > raw_spin_unlock_irq(_rq->lock); > > > > /* > > +* Because we have preemption enabled we can get migrated around and > > +* end try selecting ourselves (current == env->p) as a swap candidate. > > +*/ > > + if (cur == env->p) > > + goto unlock; > > This is too late though, because currently the lockup happens couple of lines > above that at: > > raw_spin_lock_irq(_rq->lock); <=== here > cur = dst_rq->curr; > > Which got me a bit stuck trying to use your old patch since we can't access > '->curr' > without locking dst_rq, but locking dst_rq is causing a lockup. confused... how can we lock up there. We should not be holding _any_ lock there. That a different problem that the originally reported one. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On 11/10/2014 05:03 AM, Peter Zijlstra wrote: > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1268,6 +1268,13 @@ static void task_numa_compare(struct tas > raw_spin_unlock_irq(_rq->lock); > > /* > + * Because we have preemption enabled we can get migrated around and > + * end try selecting ourselves (current == env->p) as a swap candidate. > + */ > + if (cur == env->p) > + goto unlock; This is too late though, because currently the lockup happens couple of lines above that at: raw_spin_lock_irq(_rq->lock); <=== here cur = dst_rq->curr; Which got me a bit stuck trying to use your old patch since we can't access '->curr' without locking dst_rq, but locking dst_rq is causing a lockup. Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On Sun, Nov 09, 2014 at 05:07:00PM +0300, Kirill Tkhai wrote: > > I've complained about an unrelated issue in that part of the code > > a while back (https://lkml.org/lkml/2014/5/12/508) which PeterZ > > ended up fixing (https://lkml.org/lkml/2014/5/21/428) but it seems > > that both of us forgot to follow up on that and the fix never got > > upstream. Argh, sorry for that! > The bellow is equal to the patch suggested by Peter. > > The only thing, I'm doubt, is about the comparison of cpus instead of nids. > Should task_numa_compare() be able to be called with src_nid == dst_nid > like this may happens now?! Maybe better, we should change task_numa_migrate() > and check for env.dst_nid != env.src.nid. > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 826fdf3..c18129e 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1380,6 +1380,8 @@ static void task_numa_find_cpu(struct task_numa_env > *env, > /* Skip this CPU if the source task cannot migrate */ > if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p))) > continue; > + if (cpu == env->src_cpu) > + continue; > > env->dst_cpu = cpu; > task_numa_compare(env, taskimp, groupimp); Not quite the same, current can still get migrated to another cpu than env->src_cpu, at which point we can still end up selecting ourselves. How about the below instead, that is pretty much the old patch, but with a nice comment. --- Subject: sched, numa: Avoid selecting oneself as swap target From: Peter Zijlstra Date: Mon Nov 10 10:54:35 CET 2014 Because the whole numa task selection stuff runs with preemption enabled (its long and expensive) we can end up migrating and selecting oneself as a swap target. This doesn't really work out well -- we end up trying to acquire the same lock twice for the swap migrate -- so avoid this. Cc: Rik van Riel Tested-by: Sasha Levin Reported-by: Sasha Levin Signed-off-by: Peter Zijlstra (Intel) --- kernel/sched/fair.c |7 +++ 1 file changed, 7 insertions(+) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1268,6 +1268,13 @@ static void task_numa_compare(struct tas raw_spin_unlock_irq(_rq->lock); /* +* Because we have preemption enabled we can get migrated around and +* end try selecting ourselves (current == env->p) as a swap candidate. +*/ + if (cur == env->p) + goto unlock; + + /* * "imp" is the fault differential for the source task between the * source and destination node. Calculate the total differential for * the source task and potential destination task. The more negative -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On Sun, Nov 09, 2014 at 05:07:00PM +0300, Kirill Tkhai wrote: I've complained about an unrelated issue in that part of the code a while back (https://lkml.org/lkml/2014/5/12/508) which PeterZ ended up fixing (https://lkml.org/lkml/2014/5/21/428) but it seems that both of us forgot to follow up on that and the fix never got upstream. Argh, sorry for that! The bellow is equal to the patch suggested by Peter. The only thing, I'm doubt, is about the comparison of cpus instead of nids. Should task_numa_compare() be able to be called with src_nid == dst_nid like this may happens now?! Maybe better, we should change task_numa_migrate() and check for env.dst_nid != env.src.nid. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 826fdf3..c18129e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1380,6 +1380,8 @@ static void task_numa_find_cpu(struct task_numa_env *env, /* Skip this CPU if the source task cannot migrate */ if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env-p))) continue; + if (cpu == env-src_cpu) + continue; env-dst_cpu = cpu; task_numa_compare(env, taskimp, groupimp); Not quite the same, current can still get migrated to another cpu than env-src_cpu, at which point we can still end up selecting ourselves. How about the below instead, that is pretty much the old patch, but with a nice comment. --- Subject: sched, numa: Avoid selecting oneself as swap target From: Peter Zijlstra pet...@infradead.org Date: Mon Nov 10 10:54:35 CET 2014 Because the whole numa task selection stuff runs with preemption enabled (its long and expensive) we can end up migrating and selecting oneself as a swap target. This doesn't really work out well -- we end up trying to acquire the same lock twice for the swap migrate -- so avoid this. Cc: Rik van Riel r...@redhat.com Tested-by: Sasha Levin sasha.le...@oracle.com Reported-by: Sasha Levin sasha.le...@oracle.com Signed-off-by: Peter Zijlstra (Intel) pet...@infradead.org --- kernel/sched/fair.c |7 +++ 1 file changed, 7 insertions(+) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1268,6 +1268,13 @@ static void task_numa_compare(struct tas raw_spin_unlock_irq(dst_rq-lock); /* +* Because we have preemption enabled we can get migrated around and +* end try selecting ourselves (current == env-p) as a swap candidate. +*/ + if (cur == env-p) + goto unlock; + + /* * imp is the fault differential for the source task between the * source and destination node. Calculate the total differential for * the source task and potential destination task. The more negative -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On 11/10/2014 05:03 AM, Peter Zijlstra wrote: --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1268,6 +1268,13 @@ static void task_numa_compare(struct tas raw_spin_unlock_irq(dst_rq-lock); /* + * Because we have preemption enabled we can get migrated around and + * end try selecting ourselves (current == env-p) as a swap candidate. + */ + if (cur == env-p) + goto unlock; This is too late though, because currently the lockup happens couple of lines above that at: raw_spin_lock_irq(dst_rq-lock); === here cur = dst_rq-curr; Which got me a bit stuck trying to use your old patch since we can't access '-curr' without locking dst_rq, but locking dst_rq is causing a lockup. Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On Mon, Nov 10, 2014 at 10:48:28AM -0500, Sasha Levin wrote: On 11/10/2014 05:03 AM, Peter Zijlstra wrote: --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1268,6 +1268,13 @@ static void task_numa_compare(struct tas raw_spin_unlock_irq(dst_rq-lock); /* +* Because we have preemption enabled we can get migrated around and +* end try selecting ourselves (current == env-p) as a swap candidate. +*/ + if (cur == env-p) + goto unlock; This is too late though, because currently the lockup happens couple of lines above that at: raw_spin_lock_irq(dst_rq-lock); === here cur = dst_rq-curr; Which got me a bit stuck trying to use your old patch since we can't access '-curr' without locking dst_rq, but locking dst_rq is causing a lockup. confused... how can we lock up there. We should not be holding _any_ lock there. That a different problem that the originally reported one. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote: [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13 Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task. One of those again :/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On 11/10/2014 11:03 AM, Peter Zijlstra wrote: On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote: [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13 Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task. One of those again :/ Hum. I missed that one. Hold on, but the magic here is fine and the owner pointer is fine, why would just the owner cpu be wrong? Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
В Пн, 10/11/2014 в 17:03 +0100, Peter Zijlstra пишет: On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote: [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13 Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task. One of those again :/ We do not initialyse task_struct::numa_preferred_nid for INIT_TASK. It there no a problem? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On Mon, Nov 10, 2014 at 11:09:29AM -0500, Sasha Levin wrote: On 11/10/2014 11:03 AM, Peter Zijlstra wrote: On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote: [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13 Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task. One of those again :/ Hum. I missed that one. Hold on, but the magic here is fine and the owner pointer is fine, why would just the owner cpu be wrong? I've no clue, but I've seen them before. Complete mystery that. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
В Пн, 10/11/2014 в 19:10 +0300, Kirill Tkhai пишет: В Пн, 10/11/2014 в 17:03 +0100, Peter Zijlstra пишет: On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote: [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13 Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task. One of those again :/ We do not initialyse task_struct::numa_preferred_nid for INIT_TASK. It there no a problem? I mean task_numa_find_cpu(). If a garbage is in cpumask_of_node(env-dst_nid) and cpu is bigger than mask, the check cpumask_test_cpu(cpu, tsk_cpus_allowed(env-p) may be true. So, we dereference wrong rq in task_numa_compare(). It's not rq at all. Strange cpu may be from here. It's just a int number in a wrong memory. A hypothesis that below may help: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 826fdf3..a2b4a8a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1376,6 +1376,9 @@ static void task_numa_find_cpu(struct task_numa_env *env, { int cpu; + if (!node_online(env-dst_nid)) + return; + for_each_cpu(cpu, cpumask_of_node(env-dst_nid)) { /* Skip this CPU if the source task cannot migrate */ if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env-p))) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On 11/10/2014 11:36 AM, Kirill Tkhai wrote: I mean task_numa_find_cpu(). If a garbage is in cpumask_of_node(env-dst_nid) and cpu is bigger than mask, the check cpumask_test_cpu(cpu, tsk_cpus_allowed(env-p) may be true. So, we dereference wrong rq in task_numa_compare(). It's not rq at all. Strange cpu may be from here. It's just a int number in a wrong memory. But the odds of the spinlock magic and owner pointer matching up are slim to none in that case. The memory is also likely to be valid since KASAN didn't complain about the access, so I don't believe it to be an access to freed memory. A hypothesis that below may help: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 826fdf3..a2b4a8a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1376,6 +1376,9 @@ static void task_numa_find_cpu(struct task_numa_env *env, { int cpu; + if (!node_online(env-dst_nid)) + return; I've changed that to BUG_ON(!node_online(env-dst_nid)) and will run it for a bit. Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
10.11.2014, 19:45, Sasha Levin sasha.le...@oracle.com: On 11/10/2014 11:36 AM, Kirill Tkhai wrote: I mean task_numa_find_cpu(). If a garbage is in cpumask_of_node(env-dst_nid) and cpu is bigger than mask, the check cpumask_test_cpu(cpu, tsk_cpus_allowed(env-p) may be true. So, we dereference wrong rq in task_numa_compare(). It's not rq at all. Strange cpu may be from here. It's just a int number in a wrong memory. But the odds of the spinlock magic and owner pointer matching up are slim to none in that case. The memory is also likely to be valid since KASAN didn't complain about the access, so I don't believe it to be an access to freed memory. I'm not good with lockdep checks, so I can't judge right now... Just a hypothesis. A hypothesis that below may help: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 826fdf3..a2b4a8a 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1376,6 +1376,9 @@ static void task_numa_find_cpu(struct task_numa_env *env, { int cpu; + if (!node_online(env-dst_nid)) + return; I've changed that to BUG_ON(!node_online(env-dst_nid)) and will run it for a bit. I've looked one more time, and it looks like it's better to check for BUG_ON(env-dst_nid MAX_NUMNODES). node_online() may do not work for insane nids. Anyway, even if it's not connected, we need to initialize numa_preferred_nid of init_task, because it's inherited by kernel_init() (and /sbin/init too). I'll send the patch tomorrow. Kirill -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
Hi, В Пт, 07/11/2014 в 22:48 -0500, Sasha Levin пишет: > On 10/22/2014 03:17 AM, Kirill Tkhai wrote: > > Unlocked access to dst_rq->curr in task_numa_compare() is racy. > > If curr task is exiting this may be a reason of use-after-free: > [...] > > I've complained about an unrelated issue in that part of the code > a while back (https://lkml.org/lkml/2014/5/12/508) which PeterZ > ended up fixing (https://lkml.org/lkml/2014/5/21/428) but it seems > that both of us forgot to follow up on that and the fix never got > upstream. > > Ever since this patch made it upstream, Peter's patch which I was > carrying in my tree stopped applying and I've started seeing: > > [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 > [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: > trinity-c594/11067, .owner_cpu: 13 > [ 829.539212] CPU: 10 PID: 11067 Comm: trinity-c594 Not tainted > 3.18.0-rc3-next-20141106-sasha-00054-g09b7ccf-dirty #1448 > [ 829.539226] 880053acb000 > 88032b71f828 > [ 829.539235] a009fb5a 0057 880631dd6b80 > 88032b71f868 > [ 829.539243] 963f0c57 880053acbd80 880053acbdb0 > 88032b71f858 > [ 829.539246] Call Trace: > [ 829.539265] dump_stack (lib/dump_stack.c:52) > [ 829.539277] spin_dump (kernel/locking/spinlock_debug.c:68 (discriminator > 8)) > [ 829.539282] spin_bug (kernel/locking/spinlock_debug.c:76) > [ 829.539288] do_raw_spin_lock (kernel/locking/spinlock_debug.c:84 > kernel/locking/spinlock_debug.c:135) > [ 829.539304] ? __schedule (kernel/sched/core.c:2803) > [ 829.539313] _raw_spin_lock_irq (include/linux/spinlock_api_smp.h:129 > kernel/locking/spinlock.c:167) > [ 829.539321] ? task_numa_find_cpu (kernel/sched/fair.c:1258 > kernel/sched/fair.c:1385) > [ 829.539330] ? rcu_is_watching (./arch/x86/include/asm/preempt.h:95 > kernel/rcu/tree.c:827) > [ 829.539336] task_numa_find_cpu (kernel/sched/fair.c:1258 > kernel/sched/fair.c:1385) > [ 829.539342] ? task_numa_find_cpu (kernel/sched/fair.c:1253 > kernel/sched/fair.c:1385) > [ 829.539352] ? preempt_count_sub (kernel/sched/core.c:2644) > [ 829.539358] task_numa_migrate (kernel/sched/fair.c:1452) > [ 829.539364] ? task_numa_migrate (kernel/sched/fair.c:1391) > [ 829.539376] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:87 > arch/x86/kernel/kvmclock.c:85) > [ 829.539386] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 > arch/x86/kernel/tsc.c:304) > [ 829.539392] ? sched_clock_local (kernel/sched/clock.c:202) > [ 829.539399] numa_migrate_preferred (kernel/sched/fair.c:1539) > [ 829.539404] ? sched_clock_local (kernel/sched/clock.c:202) > [ 829.539411] task_numa_fault (kernel/sched/fair.c:2073) > [ 829.539417] ? sched_clock_cpu (kernel/sched/clock.c:311) > [ 829.539429] ? debug_smp_processor_id (lib/smp_processor_id.c:57) > [ 829.539438] ? get_lock_stats (kernel/locking/lockdep.c:249) > [ 829.539446] ? get_parent_ip (kernel/sched/core.c:2588) > [ 829.539461] handle_mm_fault (mm/memory.c:3187 mm/memory.c:3233 > mm/memory.c:3346 mm/memory.c:3375) > [ 829.539466] ? find_vma (mm/mmap.c:2048) > [ 829.539477] __do_page_fault (arch/x86/mm/fault.c:1246) > [ 829.539485] ? context_tracking_user_exit (kernel/context_tracking.c:144) > [ 829.539491] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63) > [ 829.539498] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2640 > (discriminator 8)) > [ 829.539505] trace_do_page_fault (arch/x86/mm/fault.c:1329 > include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 > include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1330) > [ 829.539510] do_async_page_fault (arch/x86/kernel/kvm.c:280) > [ 829.539516] async_page_fault (arch/x86/kernel/entry_64.S:1301) The bellow is equal to the patch suggested by Peter. The only thing, I'm doubt, is about the comparison of cpus instead of nids. Should task_numa_compare() be able to be called with src_nid == dst_nid like this may happens now?! Maybe better, we should change task_numa_migrate() and check for env.dst_nid != env.src.nid. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 826fdf3..c18129e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1380,6 +1380,8 @@ static void task_numa_find_cpu(struct task_numa_env *env, /* Skip this CPU if the source task cannot migrate */ if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p))) continue; + if (cpu == env->src_cpu) + continue; env->dst_cpu = cpu; task_numa_compare(env, taskimp, groupimp); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
Hi, В Пт, 07/11/2014 в 22:48 -0500, Sasha Levin пишет: On 10/22/2014 03:17 AM, Kirill Tkhai wrote: Unlocked access to dst_rq-curr in task_numa_compare() is racy. If curr task is exiting this may be a reason of use-after-free: [...] I've complained about an unrelated issue in that part of the code a while back (https://lkml.org/lkml/2014/5/12/508) which PeterZ ended up fixing (https://lkml.org/lkml/2014/5/21/428) but it seems that both of us forgot to follow up on that and the fix never got upstream. Ever since this patch made it upstream, Peter's patch which I was carrying in my tree stopped applying and I've started seeing: [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13 [ 829.539212] CPU: 10 PID: 11067 Comm: trinity-c594 Not tainted 3.18.0-rc3-next-20141106-sasha-00054-g09b7ccf-dirty #1448 [ 829.539226] 880053acb000 88032b71f828 [ 829.539235] a009fb5a 0057 880631dd6b80 88032b71f868 [ 829.539243] 963f0c57 880053acbd80 880053acbdb0 88032b71f858 [ 829.539246] Call Trace: [ 829.539265] dump_stack (lib/dump_stack.c:52) [ 829.539277] spin_dump (kernel/locking/spinlock_debug.c:68 (discriminator 8)) [ 829.539282] spin_bug (kernel/locking/spinlock_debug.c:76) [ 829.539288] do_raw_spin_lock (kernel/locking/spinlock_debug.c:84 kernel/locking/spinlock_debug.c:135) [ 829.539304] ? __schedule (kernel/sched/core.c:2803) [ 829.539313] _raw_spin_lock_irq (include/linux/spinlock_api_smp.h:129 kernel/locking/spinlock.c:167) [ 829.539321] ? task_numa_find_cpu (kernel/sched/fair.c:1258 kernel/sched/fair.c:1385) [ 829.539330] ? rcu_is_watching (./arch/x86/include/asm/preempt.h:95 kernel/rcu/tree.c:827) [ 829.539336] task_numa_find_cpu (kernel/sched/fair.c:1258 kernel/sched/fair.c:1385) [ 829.539342] ? task_numa_find_cpu (kernel/sched/fair.c:1253 kernel/sched/fair.c:1385) [ 829.539352] ? preempt_count_sub (kernel/sched/core.c:2644) [ 829.539358] task_numa_migrate (kernel/sched/fair.c:1452) [ 829.539364] ? task_numa_migrate (kernel/sched/fair.c:1391) [ 829.539376] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:87 arch/x86/kernel/kvmclock.c:85) [ 829.539386] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:304) [ 829.539392] ? sched_clock_local (kernel/sched/clock.c:202) [ 829.539399] numa_migrate_preferred (kernel/sched/fair.c:1539) [ 829.539404] ? sched_clock_local (kernel/sched/clock.c:202) [ 829.539411] task_numa_fault (kernel/sched/fair.c:2073) [ 829.539417] ? sched_clock_cpu (kernel/sched/clock.c:311) [ 829.539429] ? debug_smp_processor_id (lib/smp_processor_id.c:57) [ 829.539438] ? get_lock_stats (kernel/locking/lockdep.c:249) [ 829.539446] ? get_parent_ip (kernel/sched/core.c:2588) [ 829.539461] handle_mm_fault (mm/memory.c:3187 mm/memory.c:3233 mm/memory.c:3346 mm/memory.c:3375) [ 829.539466] ? find_vma (mm/mmap.c:2048) [ 829.539477] __do_page_fault (arch/x86/mm/fault.c:1246) [ 829.539485] ? context_tracking_user_exit (kernel/context_tracking.c:144) [ 829.539491] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63) [ 829.539498] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2640 (discriminator 8)) [ 829.539505] trace_do_page_fault (arch/x86/mm/fault.c:1329 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1330) [ 829.539510] do_async_page_fault (arch/x86/kernel/kvm.c:280) [ 829.539516] async_page_fault (arch/x86/kernel/entry_64.S:1301) The bellow is equal to the patch suggested by Peter. The only thing, I'm doubt, is about the comparison of cpus instead of nids. Should task_numa_compare() be able to be called with src_nid == dst_nid like this may happens now?! Maybe better, we should change task_numa_migrate() and check for env.dst_nid != env.src.nid. diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 826fdf3..c18129e 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1380,6 +1380,8 @@ static void task_numa_find_cpu(struct task_numa_env *env, /* Skip this CPU if the source task cannot migrate */ if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env-p))) continue; + if (cpu == env-src_cpu) + continue; env-dst_cpu = cpu; task_numa_compare(env, taskimp, groupimp); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On 10/22/2014 03:17 AM, Kirill Tkhai wrote: > Unlocked access to dst_rq->curr in task_numa_compare() is racy. > If curr task is exiting this may be a reason of use-after-free: [...] I've complained about an unrelated issue in that part of the code a while back (https://lkml.org/lkml/2014/5/12/508) which PeterZ ended up fixing (https://lkml.org/lkml/2014/5/21/428) but it seems that both of us forgot to follow up on that and the fix never got upstream. Ever since this patch made it upstream, Peter's patch which I was carrying in my tree stopped applying and I've started seeing: [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13 [ 829.539212] CPU: 10 PID: 11067 Comm: trinity-c594 Not tainted 3.18.0-rc3-next-20141106-sasha-00054-g09b7ccf-dirty #1448 [ 829.539226] 880053acb000 88032b71f828 [ 829.539235] a009fb5a 0057 880631dd6b80 88032b71f868 [ 829.539243] 963f0c57 880053acbd80 880053acbdb0 88032b71f858 [ 829.539246] Call Trace: [ 829.539265] dump_stack (lib/dump_stack.c:52) [ 829.539277] spin_dump (kernel/locking/spinlock_debug.c:68 (discriminator 8)) [ 829.539282] spin_bug (kernel/locking/spinlock_debug.c:76) [ 829.539288] do_raw_spin_lock (kernel/locking/spinlock_debug.c:84 kernel/locking/spinlock_debug.c:135) [ 829.539304] ? __schedule (kernel/sched/core.c:2803) [ 829.539313] _raw_spin_lock_irq (include/linux/spinlock_api_smp.h:129 kernel/locking/spinlock.c:167) [ 829.539321] ? task_numa_find_cpu (kernel/sched/fair.c:1258 kernel/sched/fair.c:1385) [ 829.539330] ? rcu_is_watching (./arch/x86/include/asm/preempt.h:95 kernel/rcu/tree.c:827) [ 829.539336] task_numa_find_cpu (kernel/sched/fair.c:1258 kernel/sched/fair.c:1385) [ 829.539342] ? task_numa_find_cpu (kernel/sched/fair.c:1253 kernel/sched/fair.c:1385) [ 829.539352] ? preempt_count_sub (kernel/sched/core.c:2644) [ 829.539358] task_numa_migrate (kernel/sched/fair.c:1452) [ 829.539364] ? task_numa_migrate (kernel/sched/fair.c:1391) [ 829.539376] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:87 arch/x86/kernel/kvmclock.c:85) [ 829.539386] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:304) [ 829.539392] ? sched_clock_local (kernel/sched/clock.c:202) [ 829.539399] numa_migrate_preferred (kernel/sched/fair.c:1539) [ 829.539404] ? sched_clock_local (kernel/sched/clock.c:202) [ 829.539411] task_numa_fault (kernel/sched/fair.c:2073) [ 829.539417] ? sched_clock_cpu (kernel/sched/clock.c:311) [ 829.539429] ? debug_smp_processor_id (lib/smp_processor_id.c:57) [ 829.539438] ? get_lock_stats (kernel/locking/lockdep.c:249) [ 829.539446] ? get_parent_ip (kernel/sched/core.c:2588) [ 829.539461] handle_mm_fault (mm/memory.c:3187 mm/memory.c:3233 mm/memory.c:3346 mm/memory.c:3375) [ 829.539466] ? find_vma (mm/mmap.c:2048) [ 829.539477] __do_page_fault (arch/x86/mm/fault.c:1246) [ 829.539485] ? context_tracking_user_exit (kernel/context_tracking.c:144) [ 829.539491] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63) [ 829.539498] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2640 (discriminator 8)) [ 829.539505] trace_do_page_fault (arch/x86/mm/fault.c:1329 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1330) [ 829.539510] do_async_page_fault (arch/x86/kernel/kvm.c:280) [ 829.539516] async_page_fault (arch/x86/kernel/entry_64.S:1301) Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4] sched/numa: fix unsafe get_task_struct() in task_numa_assign()
On 10/22/2014 03:17 AM, Kirill Tkhai wrote: Unlocked access to dst_rq-curr in task_numa_compare() is racy. If curr task is exiting this may be a reason of use-after-free: [...] I've complained about an unrelated issue in that part of the code a while back (https://lkml.org/lkml/2014/5/12/508) which PeterZ ended up fixing (https://lkml.org/lkml/2014/5/21/428) but it seems that both of us forgot to follow up on that and the fix never got upstream. Ever since this patch made it upstream, Peter's patch which I was carrying in my tree stopped applying and I've started seeing: [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067 [ 829.539203] lock: 0x880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13 [ 829.539212] CPU: 10 PID: 11067 Comm: trinity-c594 Not tainted 3.18.0-rc3-next-20141106-sasha-00054-g09b7ccf-dirty #1448 [ 829.539226] 880053acb000 88032b71f828 [ 829.539235] a009fb5a 0057 880631dd6b80 88032b71f868 [ 829.539243] 963f0c57 880053acbd80 880053acbdb0 88032b71f858 [ 829.539246] Call Trace: [ 829.539265] dump_stack (lib/dump_stack.c:52) [ 829.539277] spin_dump (kernel/locking/spinlock_debug.c:68 (discriminator 8)) [ 829.539282] spin_bug (kernel/locking/spinlock_debug.c:76) [ 829.539288] do_raw_spin_lock (kernel/locking/spinlock_debug.c:84 kernel/locking/spinlock_debug.c:135) [ 829.539304] ? __schedule (kernel/sched/core.c:2803) [ 829.539313] _raw_spin_lock_irq (include/linux/spinlock_api_smp.h:129 kernel/locking/spinlock.c:167) [ 829.539321] ? task_numa_find_cpu (kernel/sched/fair.c:1258 kernel/sched/fair.c:1385) [ 829.539330] ? rcu_is_watching (./arch/x86/include/asm/preempt.h:95 kernel/rcu/tree.c:827) [ 829.539336] task_numa_find_cpu (kernel/sched/fair.c:1258 kernel/sched/fair.c:1385) [ 829.539342] ? task_numa_find_cpu (kernel/sched/fair.c:1253 kernel/sched/fair.c:1385) [ 829.539352] ? preempt_count_sub (kernel/sched/core.c:2644) [ 829.539358] task_numa_migrate (kernel/sched/fair.c:1452) [ 829.539364] ? task_numa_migrate (kernel/sched/fair.c:1391) [ 829.539376] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:87 arch/x86/kernel/kvmclock.c:85) [ 829.539386] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:304) [ 829.539392] ? sched_clock_local (kernel/sched/clock.c:202) [ 829.539399] numa_migrate_preferred (kernel/sched/fair.c:1539) [ 829.539404] ? sched_clock_local (kernel/sched/clock.c:202) [ 829.539411] task_numa_fault (kernel/sched/fair.c:2073) [ 829.539417] ? sched_clock_cpu (kernel/sched/clock.c:311) [ 829.539429] ? debug_smp_processor_id (lib/smp_processor_id.c:57) [ 829.539438] ? get_lock_stats (kernel/locking/lockdep.c:249) [ 829.539446] ? get_parent_ip (kernel/sched/core.c:2588) [ 829.539461] handle_mm_fault (mm/memory.c:3187 mm/memory.c:3233 mm/memory.c:3346 mm/memory.c:3375) [ 829.539466] ? find_vma (mm/mmap.c:2048) [ 829.539477] __do_page_fault (arch/x86/mm/fault.c:1246) [ 829.539485] ? context_tracking_user_exit (kernel/context_tracking.c:144) [ 829.539491] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63) [ 829.539498] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2640 (discriminator 8)) [ 829.539505] trace_do_page_fault (arch/x86/mm/fault.c:1329 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1330) [ 829.539510] do_async_page_fault (arch/x86/kernel/kvm.c:280) [ 829.539516] async_page_fault (arch/x86/kernel/entry_64.S:1301) Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/