Re: sched: divide error in sg_capacity_factor

2015-03-10 Thread Sasha Levin
On 03/10/2015 12:29 AM, Ingo Molnar wrote:
> 
> * Sasha Levin  wrote:
> 
>> Hi all,
>>
>> While fuzzing with trinity inside the latest -next kernel using trinity I've 
>> stumbled on:
>>
>> [  936.784266] divide error:  [#1] PREEMPT SMP KASAN
>> [  936.793957] RIP: find_busiest_group (kernel/sched/fair.c:6152 
>> kernel/sched/fair.c:6223 kernel/sched/fair.c:6341 kernel/sched/fair.c:6603)
> 
> Hm, these line numbers don't seem to match up very well with my 
> version of linux-next:
> 
>   28855005be1d Add linux-next specific files for 20150306
> 
> and the Git version info included in the oops seems useless:
> 
>   4.0.0-rc1-sasha-00044-ge21109a
> 
>   $ git log e21109a
>   fatal: ambiguous argument 'e21109a': unknown revision or path not in the 
> working tree.
> 
> I think the kernel's SHA1 should be made at least 12 char wide, 
> regardless of the user's gitconfig::core.abbrev settings.
> 
> Also, latest linux-next is -rc2 based, while your version string says 
> -rc1.
> 
>> [  936.829403] load_balance (kernel/sched/fair.c:6857)
> 
> this does not match up either.
> 
>> [  936.829403] run_rebalance_domains (kernel/sched/fair.c:7494 
>> kernel/sched/fair.c:7659)
> 
> The line numbers are not even close to anything related: 
> run_rebalance_domains() starts at line 7666 and ends at 7680.

Right, this is my fuck up. It seems that I was fuzzing 4.0-rc1 rather than
-next as I thought I was. I forgot to go back to -next after I tested a few
things on Linus's tree.

So the line numbers should match correctly with Linus's tree as you've already
guessed below.

> Also, why are the offsets into the function missing from the output? 
> Those would allow the rough determination of the crash site, even if 
> debuginfo is crap.

I found that offsets are useless here because of the really odd things the
compiler does based on my config. There's so many things that got inlined
in this case that I think offsets wouldn't mean anything to you here.

For example, in this case the division by 0 happened on 
load_balance+0x88a/0x2399.

> I also checked Linus's latest, and they do seem to match up better:
> 
>   affb8172de39 Merge git://git.kernel.org/pub/scm/virt/kvm/kvm
> 
> and the line number gives:
> 
> capacity_factor = min_t(unsigned,
> capacity_factor, DIV_ROUND_CLOSEST(capacity, 
> SCHED_CAPACITY_SCALE));
> 
> but that's a division with a constant? Should not trap.
> 
> So I rebuild a kernel with debug info, pattern matched the disassembly 
> you provided, and that gave me this division:
> 
> (gdb) list *0x8107d958
> 0x8107d958 is in find_busiest_group (kernel/sched/fair.c:6162).
> 6157capacity = group->sgc->capacity;
> 6158capacity_orig = group->sgc->capacity_orig;
> 6159cpus = group->group_weight;
> 6160
> 6161/* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 
> 2 */
> 6162smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, 
> capacity_orig);
> 6163capacity_factor = cpus / smt; /* cores */
> 6164
> 6165capacity_factor = min_t(unsigned,
> 6166capacity_factor, DIV_ROUND_CLOSEST(capacity, 
> SCHED_CAPACITY_SCALE));

This is the division I was seeing as well.

> So this too seems not very plausible: 'capacity_orig' comes straight 
> from group->sgc->capacity_orig, which is:
> 
>  - boot time initialized
> 
>  - sometimes recalculated during CPU hot-plug: not sure how much of 
>that your tests are doing?

I'm not forcing them, but they do happen pretty often.

>  - but otherwise it's fairly constant and should have crashed your 
>system early on if it was set up wrong
> 
> unless I missed something that is.
> 
>> [  936.829403] __do_softirq (kernel/softirq.c:273 
>> include/linux/jump_label.h:114 include/trace/events/irq.h:126 
>> kernel/softirq.c:274)
>> [  936.829403] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
>> [  936.829403] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918)
>> [  936.829403] apic_timer_interrupt (arch/x86/kernel/entry_64.S:958)
>> [  936.829403]  
>> [  936.829403] print_context_stack (arch/x86/kernel/dumpstack.c:105)
>> [  936.829403] dump_trace (arch/x86/kernel/dumpstack_64.c:244)
>> [  936.829403] save_stack_trace (arch/x86/kernel/stacktrace.c:64)
>> [  936.829403] __set_page_owner (mm/page_owner.c:72)
>> [  936.829403] get_page_from_freelist (include/linux/page_owner.h:26 
>> mm/page_alloc.c:2176)
>> [  936.829403] __alloc_pages_nodemask (mm/page_alloc.c:2844)
>> [  936.829403] alloc_pages_vma (mm/mempolicy.c:2007)
>> [  936.829403] handle_mm_fault (mm/memory.c:2156 mm/memory.c:3164 
>> mm/memory.c:3269 mm/memory.c:3298)
>> [  936.829403] __do_page_fault (arch/x86/mm/fault.c:1235)
>> [  936.829403] trace_do_page_fault (arch/x86/mm/fault.c:1329)
>> [  936.829403] do_async_page_fault (arch/x86/kernel/kvm.c:280)
>> [  936.829403] async_page_fault (arch/x86/kernel/entry_64.S:1295)
> 
> So debug info weirdnesses aside, 

Re: sched: divide error in sg_capacity_factor

2015-03-10 Thread Sasha Levin
On 03/10/2015 12:29 AM, Ingo Molnar wrote:
 
 * Sasha Levin sasha.le...@oracle.com wrote:
 
 Hi all,

 While fuzzing with trinity inside the latest -next kernel using trinity I've 
 stumbled on:

 [  936.784266] divide error:  [#1] PREEMPT SMP KASAN
 [  936.793957] RIP: find_busiest_group (kernel/sched/fair.c:6152 
 kernel/sched/fair.c:6223 kernel/sched/fair.c:6341 kernel/sched/fair.c:6603)
 
 Hm, these line numbers don't seem to match up very well with my 
 version of linux-next:
 
   28855005be1d Add linux-next specific files for 20150306
 
 and the Git version info included in the oops seems useless:
 
   4.0.0-rc1-sasha-00044-ge21109a
 
   $ git log e21109a
   fatal: ambiguous argument 'e21109a': unknown revision or path not in the 
 working tree.
 
 I think the kernel's SHA1 should be made at least 12 char wide, 
 regardless of the user's gitconfig::core.abbrev settings.
 
 Also, latest linux-next is -rc2 based, while your version string says 
 -rc1.
 
 [  936.829403] load_balance (kernel/sched/fair.c:6857)
 
 this does not match up either.
 
 [  936.829403] run_rebalance_domains (kernel/sched/fair.c:7494 
 kernel/sched/fair.c:7659)
 
 The line numbers are not even close to anything related: 
 run_rebalance_domains() starts at line 7666 and ends at 7680.

Right, this is my fuck up. It seems that I was fuzzing 4.0-rc1 rather than
-next as I thought I was. I forgot to go back to -next after I tested a few
things on Linus's tree.

So the line numbers should match correctly with Linus's tree as you've already
guessed below.

 Also, why are the offsets into the function missing from the output? 
 Those would allow the rough determination of the crash site, even if 
 debuginfo is crap.

I found that offsets are useless here because of the really odd things the
compiler does based on my config. There's so many things that got inlined
in this case that I think offsets wouldn't mean anything to you here.

For example, in this case the division by 0 happened on 
load_balance+0x88a/0x2399.

 I also checked Linus's latest, and they do seem to match up better:
 
   affb8172de39 Merge git://git.kernel.org/pub/scm/virt/kvm/kvm
 
 and the line number gives:
 
 capacity_factor = min_t(unsigned,
 capacity_factor, DIV_ROUND_CLOSEST(capacity, 
 SCHED_CAPACITY_SCALE));
 
 but that's a division with a constant? Should not trap.
 
 So I rebuild a kernel with debug info, pattern matched the disassembly 
 you provided, and that gave me this division:
 
 (gdb) list *0x8107d958
 0x8107d958 is in find_busiest_group (kernel/sched/fair.c:6162).
 6157capacity = group-sgc-capacity;
 6158capacity_orig = group-sgc-capacity_orig;
 6159cpus = group-group_weight;
 6160
 6161/* smt := ceil(cpus / capacity), assumes: 1  smt_capacity  
 2 */
 6162smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, 
 capacity_orig);
 6163capacity_factor = cpus / smt; /* cores */
 6164
 6165capacity_factor = min_t(unsigned,
 6166capacity_factor, DIV_ROUND_CLOSEST(capacity, 
 SCHED_CAPACITY_SCALE));

This is the division I was seeing as well.

 So this too seems not very plausible: 'capacity_orig' comes straight 
 from group-sgc-capacity_orig, which is:
 
  - boot time initialized
 
  - sometimes recalculated during CPU hot-plug: not sure how much of 
that your tests are doing?

I'm not forcing them, but they do happen pretty often.

  - but otherwise it's fairly constant and should have crashed your 
system early on if it was set up wrong
 
 unless I missed something that is.
 
 [  936.829403] __do_softirq (kernel/softirq.c:273 
 include/linux/jump_label.h:114 include/trace/events/irq.h:126 
 kernel/softirq.c:274)
 [  936.829403] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
 [  936.829403] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918)
 [  936.829403] apic_timer_interrupt (arch/x86/kernel/entry_64.S:958)
 [  936.829403]  EOI
 [  936.829403] print_context_stack (arch/x86/kernel/dumpstack.c:105)
 [  936.829403] dump_trace (arch/x86/kernel/dumpstack_64.c:244)
 [  936.829403] save_stack_trace (arch/x86/kernel/stacktrace.c:64)
 [  936.829403] __set_page_owner (mm/page_owner.c:72)
 [  936.829403] get_page_from_freelist (include/linux/page_owner.h:26 
 mm/page_alloc.c:2176)
 [  936.829403] __alloc_pages_nodemask (mm/page_alloc.c:2844)
 [  936.829403] alloc_pages_vma (mm/mempolicy.c:2007)
 [  936.829403] handle_mm_fault (mm/memory.c:2156 mm/memory.c:3164 
 mm/memory.c:3269 mm/memory.c:3298)
 [  936.829403] __do_page_fault (arch/x86/mm/fault.c:1235)
 [  936.829403] trace_do_page_fault (arch/x86/mm/fault.c:1329)
 [  936.829403] do_async_page_fault (arch/x86/kernel/kvm.c:280)
 [  936.829403] async_page_fault (arch/x86/kernel/entry_64.S:1295)
 
 So debug info weirdnesses aside, other divisions in 
 find_busiest_group():
 
 sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
 

Re: sched: divide error in sg_capacity_factor

2015-03-09 Thread Ingo Molnar

* Sasha Levin  wrote:

> Hi all,
> 
> While fuzzing with trinity inside the latest -next kernel using trinity I've 
> stumbled on:
> 
> [  936.784266] divide error:  [#1] PREEMPT SMP KASAN
> [  936.793957] RIP: find_busiest_group (kernel/sched/fair.c:6152 
> kernel/sched/fair.c:6223 kernel/sched/fair.c:6341 kernel/sched/fair.c:6603)

Hm, these line numbers don't seem to match up very well with my 
version of linux-next:

  28855005be1d Add linux-next specific files for 20150306

and the Git version info included in the oops seems useless:

  4.0.0-rc1-sasha-00044-ge21109a

  $ git log e21109a
  fatal: ambiguous argument 'e21109a': unknown revision or path not in the 
working tree.

I think the kernel's SHA1 should be made at least 12 char wide, 
regardless of the user's gitconfig::core.abbrev settings.

Also, latest linux-next is -rc2 based, while your version string says 
-rc1.

> [  936.829403] load_balance (kernel/sched/fair.c:6857)

this does not match up either.

> [  936.829403] run_rebalance_domains (kernel/sched/fair.c:7494 
> kernel/sched/fair.c:7659)

The line numbers are not even close to anything related: 
run_rebalance_domains() starts at line 7666 and ends at 7680.

Also, why are the offsets into the function missing from the output? 
Those would allow the rough determination of the crash site, even if 
debuginfo is crap.

I also checked Linus's latest, and they do seem to match up better:

  affb8172de39 Merge git://git.kernel.org/pub/scm/virt/kvm/kvm

and the line number gives:

capacity_factor = min_t(unsigned,
capacity_factor, DIV_ROUND_CLOSEST(capacity, 
SCHED_CAPACITY_SCALE));

but that's a division with a constant? Should not trap.

So I rebuild a kernel with debug info, pattern matched the disassembly 
you provided, and that gave me this division:

(gdb) list *0x8107d958
0x8107d958 is in find_busiest_group (kernel/sched/fair.c:6162).
6157capacity = group->sgc->capacity;
6158capacity_orig = group->sgc->capacity_orig;
6159cpus = group->group_weight;
6160
6161/* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 
*/
6162smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
6163capacity_factor = cpus / smt; /* cores */
6164
6165capacity_factor = min_t(unsigned,
6166capacity_factor, DIV_ROUND_CLOSEST(capacity, 
SCHED_CAPACITY_SCALE));

So this too seems not very plausible: 'capacity_orig' comes straight 
from group->sgc->capacity_orig, which is:

 - boot time initialized

 - sometimes recalculated during CPU hot-plug: not sure how much of 
   that your tests are doing?

 - but otherwise it's fairly constant and should have crashed your 
   system early on if it was set up wrong

unless I missed something that is.

> [  936.829403] __do_softirq (kernel/softirq.c:273 
> include/linux/jump_label.h:114 include/trace/events/irq.h:126 
> kernel/softirq.c:274)
> [  936.829403] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
> [  936.829403] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918)
> [  936.829403] apic_timer_interrupt (arch/x86/kernel/entry_64.S:958)
> [  936.829403]  
> [  936.829403] print_context_stack (arch/x86/kernel/dumpstack.c:105)
> [  936.829403] dump_trace (arch/x86/kernel/dumpstack_64.c:244)
> [  936.829403] save_stack_trace (arch/x86/kernel/stacktrace.c:64)
> [  936.829403] __set_page_owner (mm/page_owner.c:72)
> [  936.829403] get_page_from_freelist (include/linux/page_owner.h:26 
> mm/page_alloc.c:2176)
> [  936.829403] __alloc_pages_nodemask (mm/page_alloc.c:2844)
> [  936.829403] alloc_pages_vma (mm/mempolicy.c:2007)
> [  936.829403] handle_mm_fault (mm/memory.c:2156 mm/memory.c:3164 
> mm/memory.c:3269 mm/memory.c:3298)
> [  936.829403] __do_page_fault (arch/x86/mm/fault.c:1235)
> [  936.829403] trace_do_page_fault (arch/x86/mm/fault.c:1329)
> [  936.829403] do_async_page_fault (arch/x86/kernel/kvm.c:280)
> [  936.829403] async_page_fault (arch/x86/kernel/entry_64.S:1295)

So debug info weirdnesses aside, other divisions in 
find_busiest_group():

sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
/ sds.total_capacity;

total_capacity ought to be zero only on a totally borked machine 
(unlikely to boot), or on memory corruption.

if calculate_imbalance() got inlined, then:

load_above_capacity /= busiest->group_capacity;

that too ought to only get corrupted in the most serious cases, we 
don't recalculate it runtime.

So I'm baffled. Some tentative handwaving, pointing away from the 
scheduler:

 - Your stack trace is 'weird' not just due to debug info: an async 
   page fault doing allocations, doing a stack trace, interruted by a 
   timer irq, doing scheduler rebalancing...

 - The (spectacularly misnamed [*] ) CONFIG_PAGE_OWNER=y page lifetime
   tracing facility got enabled explicitly via the page_owner=on boot 

sched: divide error in sg_capacity_factor

2015-03-09 Thread Sasha Levin
Hi all,

While fuzzing with trinity inside the latest -next kernel using trinity I've 
stumbled on:

[  936.784266] divide error:  [#1] PREEMPT SMP KASAN
[  936.789198] Dumping ftrace buffer:
[  936.793957](ftrace buffer empty)
[  936.793957] Modules linked in:
[  936.793957] CPU: 52 PID: 22110 Comm: trinity-c52 Tainted: GW   
4.0.0-rc1-sasha-00044-ge21109a #2039
[  936.793957] task: 8807ff293000 ti: 880f81fe8000 task.ti: 
880f81fe8000
[  936.793957] RIP: find_busiest_group (kernel/sched/fair.c:6152 
kernel/sched/fair.c:6223 kernel/sched/fair.c:6341 kernel/sched/fair.c:6603)
[  936.829403] RSP: :8810c28079a8  EFLAGS: 00010206
[  936.829403] RAX: 03ff RBX: 004e RCX: 2000
[  936.829403] RDX:  RSI:  RDI: 
[  936.829403] RBP: 8810c2807be8 R08: 0001 R09: 0001
[  936.829403] R10: 0001 R11: 0008 R12: dc00
[  936.829403] R13: 0001 R14: 8810c2807b40 R15: 8810c2807ce8
[  936.829403] FS:  7f89c95ff700() GS:8810c280() 
knlGS:
[  936.829403] CS:  0010 DS:  ES:  CR0: 8005003b
[  936.829403] CR2: 03503ff8 CR3: 000f8237b000 CR4: 07a0
[  936.829403] DR0:  DR1:  DR2: 
[  936.829403] DR3:  DR6: 0ff0 DR7: 00050602
[  936.829403] Stack:
[  936.829403]  0082 0001 8810c28079d8 
8810c2817a88
[  936.829403]   110218500f4b c2817a88 
8810c2807d14
[  936.829403]  8810c2807b50 8810c2807cfc 8810c0740010 
000307418e2e
[  936.829403] Call Trace:
[  936.829403]  
[  936.829403] ? __enqueue_entity (kernel/sched/fair.c:501)
[  936.829403] ? update_group_capacity (kernel/sched/fair.c:6593)
[  936.829403] ? update_cfs_shares (kernel/sched/fair.c:2375)
[  936.829403] ? cpumask_next_and (lib/cpumask.c:40)
[  936.829403] load_balance (kernel/sched/fair.c:6857)
[  936.829403] ? _raw_spin_unlock_irqrestore 
(./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:163 
kernel/locking/spinlock.c:191)
[  936.829403] ? update_blocked_averages (kernel/sched/fair.c:5743)
[  936.829403] ? find_busiest_group (kernel/sched/fair.c:6820)
[  936.829403] ? run_rebalance_domains (kernel/sched/fair.c:7450 
kernel/sched/fair.c:7659)
[  936.829403] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2566)
[  936.829403] run_rebalance_domains (kernel/sched/fair.c:7494 
kernel/sched/fair.c:7659)
[  936.829403] ? run_rebalance_domains (kernel/sched/fair.c:7450 
kernel/sched/fair.c:7659)
[  936.829403] ? pick_next_task_fair (kernel/sched/fair.c:7654)
[  936.829403] ? irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
[  936.829403] __do_softirq (kernel/softirq.c:273 
include/linux/jump_label.h:114 include/trace/events/irq.h:126 
kernel/softirq.c:274)
[  936.829403] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
[  936.829403] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918)
[  936.829403] apic_timer_interrupt (arch/x86/kernel/entry_64.S:958)
[  936.829403]  
[  936.829403] ? is_module_address (kernel/module.c:3835)
[  936.829403] ? __kernel_text_address (kernel/extable.c:104)
[  936.829403] print_context_stack (arch/x86/kernel/dumpstack.c:105)
[  936.829403] dump_trace (arch/x86/kernel/dumpstack_64.c:244)
[  936.829403] save_stack_trace (arch/x86/kernel/stacktrace.c:64)
[  936.829403] __set_page_owner (mm/page_owner.c:72)
[  936.829403] ? __reset_page_owner (mm/page_owner.c:61)
[  936.829403] ? __inc_zone_state (mm/vmstat.c:271)
[  936.829403] get_page_from_freelist (include/linux/page_owner.h:26 
mm/page_alloc.c:2176)
[  936.829403] __alloc_pages_nodemask (mm/page_alloc.c:2844)
[  936.829403] ? alloc_pages_vma (mm/mempolicy.c:2007)
[  936.829403] ? debug_check_no_locks_freed (kernel/locking/lockdep.c:3051)
[  936.829403] ? debug_check_no_locks_freed (kernel/locking/lockdep.c:3051)
[  936.829403] ? __alloc_pages_direct_compact (mm/page_alloc.c:2797)
[  936.829403] ? debug_check_no_locks_freed (kernel/locking/lockdep.c:3051)
[  936.829403] ? arch_local_irq_restore (init/do_mounts.h:19)
[  936.829403] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2566)
[  936.829403] alloc_pages_vma (mm/mempolicy.c:2007)
[  936.829403] ? handle_mm_fault (mm/memory.c:2156 mm/memory.c:3164 
mm/memory.c:3269 mm/memory.c:3298)
[  936.829403] handle_mm_fault (mm/memory.c:2156 mm/memory.c:3164 
mm/memory.c:3269 mm/memory.c:3298)
[  936.829403] ? debug_check_no_locks_freed (kernel/locking/lockdep.c:3051)
[  936.829403] ? __pmd_alloc (mm/memory.c:3280)
[  936.829403] ? perf_event_context_sched_in (kernel/events/core.c:2755)
[  936.829403] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2566)
[  936.829403] ? __do_page_fault (arch/x86/mm/fault.c:1173)
[  936.829403] ? ___might_sleep (kernel/sched/core.c:7297 

sched: divide error in sg_capacity_factor

2015-03-09 Thread Sasha Levin
Hi all,

While fuzzing with trinity inside the latest -next kernel using trinity I've 
stumbled on:

[  936.784266] divide error:  [#1] PREEMPT SMP KASAN
[  936.789198] Dumping ftrace buffer:
[  936.793957](ftrace buffer empty)
[  936.793957] Modules linked in:
[  936.793957] CPU: 52 PID: 22110 Comm: trinity-c52 Tainted: GW   
4.0.0-rc1-sasha-00044-ge21109a #2039
[  936.793957] task: 8807ff293000 ti: 880f81fe8000 task.ti: 
880f81fe8000
[  936.793957] RIP: find_busiest_group (kernel/sched/fair.c:6152 
kernel/sched/fair.c:6223 kernel/sched/fair.c:6341 kernel/sched/fair.c:6603)
[  936.829403] RSP: :8810c28079a8  EFLAGS: 00010206
[  936.829403] RAX: 03ff RBX: 004e RCX: 2000
[  936.829403] RDX:  RSI:  RDI: 
[  936.829403] RBP: 8810c2807be8 R08: 0001 R09: 0001
[  936.829403] R10: 0001 R11: 0008 R12: dc00
[  936.829403] R13: 0001 R14: 8810c2807b40 R15: 8810c2807ce8
[  936.829403] FS:  7f89c95ff700() GS:8810c280() 
knlGS:
[  936.829403] CS:  0010 DS:  ES:  CR0: 8005003b
[  936.829403] CR2: 03503ff8 CR3: 000f8237b000 CR4: 07a0
[  936.829403] DR0:  DR1:  DR2: 
[  936.829403] DR3:  DR6: 0ff0 DR7: 00050602
[  936.829403] Stack:
[  936.829403]  0082 0001 8810c28079d8 
8810c2817a88
[  936.829403]   110218500f4b c2817a88 
8810c2807d14
[  936.829403]  8810c2807b50 8810c2807cfc 8810c0740010 
000307418e2e
[  936.829403] Call Trace:
[  936.829403]  IRQ
[  936.829403] ? __enqueue_entity (kernel/sched/fair.c:501)
[  936.829403] ? update_group_capacity (kernel/sched/fair.c:6593)
[  936.829403] ? update_cfs_shares (kernel/sched/fair.c:2375)
[  936.829403] ? cpumask_next_and (lib/cpumask.c:40)
[  936.829403] load_balance (kernel/sched/fair.c:6857)
[  936.829403] ? _raw_spin_unlock_irqrestore 
(./arch/x86/include/asm/preempt.h:95 include/linux/spinlock_api_smp.h:163 
kernel/locking/spinlock.c:191)
[  936.829403] ? update_blocked_averages (kernel/sched/fair.c:5743)
[  936.829403] ? find_busiest_group (kernel/sched/fair.c:6820)
[  936.829403] ? run_rebalance_domains (kernel/sched/fair.c:7450 
kernel/sched/fair.c:7659)
[  936.829403] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2566)
[  936.829403] run_rebalance_domains (kernel/sched/fair.c:7494 
kernel/sched/fair.c:7659)
[  936.829403] ? run_rebalance_domains (kernel/sched/fair.c:7450 
kernel/sched/fair.c:7659)
[  936.829403] ? pick_next_task_fair (kernel/sched/fair.c:7654)
[  936.829403] ? irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
[  936.829403] __do_softirq (kernel/softirq.c:273 
include/linux/jump_label.h:114 include/trace/events/irq.h:126 
kernel/softirq.c:274)
[  936.829403] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
[  936.829403] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918)
[  936.829403] apic_timer_interrupt (arch/x86/kernel/entry_64.S:958)
[  936.829403]  EOI
[  936.829403] ? is_module_address (kernel/module.c:3835)
[  936.829403] ? __kernel_text_address (kernel/extable.c:104)
[  936.829403] print_context_stack (arch/x86/kernel/dumpstack.c:105)
[  936.829403] dump_trace (arch/x86/kernel/dumpstack_64.c:244)
[  936.829403] save_stack_trace (arch/x86/kernel/stacktrace.c:64)
[  936.829403] __set_page_owner (mm/page_owner.c:72)
[  936.829403] ? __reset_page_owner (mm/page_owner.c:61)
[  936.829403] ? __inc_zone_state (mm/vmstat.c:271)
[  936.829403] get_page_from_freelist (include/linux/page_owner.h:26 
mm/page_alloc.c:2176)
[  936.829403] __alloc_pages_nodemask (mm/page_alloc.c:2844)
[  936.829403] ? alloc_pages_vma (mm/mempolicy.c:2007)
[  936.829403] ? debug_check_no_locks_freed (kernel/locking/lockdep.c:3051)
[  936.829403] ? debug_check_no_locks_freed (kernel/locking/lockdep.c:3051)
[  936.829403] ? __alloc_pages_direct_compact (mm/page_alloc.c:2797)
[  936.829403] ? debug_check_no_locks_freed (kernel/locking/lockdep.c:3051)
[  936.829403] ? arch_local_irq_restore (init/do_mounts.h:19)
[  936.829403] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2566)
[  936.829403] alloc_pages_vma (mm/mempolicy.c:2007)
[  936.829403] ? handle_mm_fault (mm/memory.c:2156 mm/memory.c:3164 
mm/memory.c:3269 mm/memory.c:3298)
[  936.829403] handle_mm_fault (mm/memory.c:2156 mm/memory.c:3164 
mm/memory.c:3269 mm/memory.c:3298)
[  936.829403] ? debug_check_no_locks_freed (kernel/locking/lockdep.c:3051)
[  936.829403] ? __pmd_alloc (mm/memory.c:3280)
[  936.829403] ? perf_event_context_sched_in (kernel/events/core.c:2755)
[  936.829403] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2566)
[  936.829403] ? __do_page_fault (arch/x86/mm/fault.c:1173)
[  936.829403] ? ___might_sleep 

Re: sched: divide error in sg_capacity_factor

2015-03-09 Thread Ingo Molnar

* Sasha Levin sasha.le...@oracle.com wrote:

 Hi all,
 
 While fuzzing with trinity inside the latest -next kernel using trinity I've 
 stumbled on:
 
 [  936.784266] divide error:  [#1] PREEMPT SMP KASAN
 [  936.793957] RIP: find_busiest_group (kernel/sched/fair.c:6152 
 kernel/sched/fair.c:6223 kernel/sched/fair.c:6341 kernel/sched/fair.c:6603)

Hm, these line numbers don't seem to match up very well with my 
version of linux-next:

  28855005be1d Add linux-next specific files for 20150306

and the Git version info included in the oops seems useless:

  4.0.0-rc1-sasha-00044-ge21109a

  $ git log e21109a
  fatal: ambiguous argument 'e21109a': unknown revision or path not in the 
working tree.

I think the kernel's SHA1 should be made at least 12 char wide, 
regardless of the user's gitconfig::core.abbrev settings.

Also, latest linux-next is -rc2 based, while your version string says 
-rc1.

 [  936.829403] load_balance (kernel/sched/fair.c:6857)

this does not match up either.

 [  936.829403] run_rebalance_domains (kernel/sched/fair.c:7494 
 kernel/sched/fair.c:7659)

The line numbers are not even close to anything related: 
run_rebalance_domains() starts at line 7666 and ends at 7680.

Also, why are the offsets into the function missing from the output? 
Those would allow the rough determination of the crash site, even if 
debuginfo is crap.

I also checked Linus's latest, and they do seem to match up better:

  affb8172de39 Merge git://git.kernel.org/pub/scm/virt/kvm/kvm

and the line number gives:

capacity_factor = min_t(unsigned,
capacity_factor, DIV_ROUND_CLOSEST(capacity, 
SCHED_CAPACITY_SCALE));

but that's a division with a constant? Should not trap.

So I rebuild a kernel with debug info, pattern matched the disassembly 
you provided, and that gave me this division:

(gdb) list *0x8107d958
0x8107d958 is in find_busiest_group (kernel/sched/fair.c:6162).
6157capacity = group-sgc-capacity;
6158capacity_orig = group-sgc-capacity_orig;
6159cpus = group-group_weight;
6160
6161/* smt := ceil(cpus / capacity), assumes: 1  smt_capacity  2 
*/
6162smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
6163capacity_factor = cpus / smt; /* cores */
6164
6165capacity_factor = min_t(unsigned,
6166capacity_factor, DIV_ROUND_CLOSEST(capacity, 
SCHED_CAPACITY_SCALE));

So this too seems not very plausible: 'capacity_orig' comes straight 
from group-sgc-capacity_orig, which is:

 - boot time initialized

 - sometimes recalculated during CPU hot-plug: not sure how much of 
   that your tests are doing?

 - but otherwise it's fairly constant and should have crashed your 
   system early on if it was set up wrong

unless I missed something that is.

 [  936.829403] __do_softirq (kernel/softirq.c:273 
 include/linux/jump_label.h:114 include/trace/events/irq.h:126 
 kernel/softirq.c:274)
 [  936.829403] irq_exit (kernel/softirq.c:350 kernel/softirq.c:391)
 [  936.829403] smp_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:918)
 [  936.829403] apic_timer_interrupt (arch/x86/kernel/entry_64.S:958)
 [  936.829403]  EOI
 [  936.829403] print_context_stack (arch/x86/kernel/dumpstack.c:105)
 [  936.829403] dump_trace (arch/x86/kernel/dumpstack_64.c:244)
 [  936.829403] save_stack_trace (arch/x86/kernel/stacktrace.c:64)
 [  936.829403] __set_page_owner (mm/page_owner.c:72)
 [  936.829403] get_page_from_freelist (include/linux/page_owner.h:26 
 mm/page_alloc.c:2176)
 [  936.829403] __alloc_pages_nodemask (mm/page_alloc.c:2844)
 [  936.829403] alloc_pages_vma (mm/mempolicy.c:2007)
 [  936.829403] handle_mm_fault (mm/memory.c:2156 mm/memory.c:3164 
 mm/memory.c:3269 mm/memory.c:3298)
 [  936.829403] __do_page_fault (arch/x86/mm/fault.c:1235)
 [  936.829403] trace_do_page_fault (arch/x86/mm/fault.c:1329)
 [  936.829403] do_async_page_fault (arch/x86/kernel/kvm.c:280)
 [  936.829403] async_page_fault (arch/x86/kernel/entry_64.S:1295)

So debug info weirdnesses aside, other divisions in 
find_busiest_group():

sds.avg_load = (SCHED_CAPACITY_SCALE * sds.total_load)
/ sds.total_capacity;

total_capacity ought to be zero only on a totally borked machine 
(unlikely to boot), or on memory corruption.

if calculate_imbalance() got inlined, then:

load_above_capacity /= busiest-group_capacity;

that too ought to only get corrupted in the most serious cases, we 
don't recalculate it runtime.

So I'm baffled. Some tentative handwaving, pointing away from the 
scheduler:

 - Your stack trace is 'weird' not just due to debug info: an async 
   page fault doing allocations, doing a stack trace, interruted by a 
   timer irq, doing scheduler rebalancing...

 - The (spectacularly misnamed [*] ) CONFIG_PAGE_OWNER=y page lifetime
   tracing facility got enabled explicitly via the page_owner=on boot 
   parameter,