Re: run_timer_softirq gpf. tracing?
On Tue, 21 Mar 2017, Dave Jones wrote: > I just hit this while fuzzing.. > > general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC > CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc2-think+ #1 > task: 88017f0ed440 task.stack: c9094000 > RIP: 0010:run_timer_softirq+0x15f/0x700 > RSP: 0018:880507c03ec8 EFLAGS: 00010086 > RAX: dead0200 RBX: 880507dd0d00 RCX: 0002 > RDX: 880507c03ed0 RSI: RDI: 8204b3a0 > RBP: 880507c03f48 R08: 880507dd12d0 R09: 880507c03ed8 > R10: 880507dd0db0 R11: R12: 8215cc38 > R13: 880507c03ed0 R14: 82005188 R15: 8804b55491a8 > FS: () GS:880507c0() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 0004 CR3: 05011000 CR4: 001406e0 > Call Trace: > > ? clockevents_program_event+0x47/0x120 > __do_softirq+0xbf/0x5b1 > irq_exit+0xb5/0xc0 > smp_apic_timer_interrupt+0x3d/0x50 > apic_timer_interrupt+0x97/0xa0 > RIP: 0010:cpuidle_enter_state+0x12e/0x400 > RSP: 0018:c9097e40 EFLAGS: 0202 > [CONT START] ORIG_RAX: ff10 > RAX: 88017f0ed440 RBX: e8a03cc8 RCX: 0001 > RDX: 20c49ba5e353f7cf RSI: 0001 RDI: 88017f0ed440 > RBP: c9097e80 R08: R09: 0008 > R10: R11: R12: 0005 > R13: 820b9338 R14: 0005 R15: 820b9320 > > cpuidle_enter+0x17/0x20 > call_cpuidle+0x23/0x40 > do_idle+0xfb/0x200 > cpu_startup_entry+0x71/0x80 > start_secondary+0x16a/0x210 > start_cpu+0x14/0x14 > Code: 8b 05 ce 1b ef 7e 83 f8 03 0f 87 4e 01 00 00 89 c0 49 0f a3 04 24 0f 82 > 0a 01 00 00 49 8b 07 49 8b 57 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 41 f6 > 47 2a 20 49 c7 47 08 00 00 00 00 48 89 df 48 > > > RAX looks like list poison, and CR2 = 4, which is likely the ->next of a list, > with a NULL pointer. Certainly not on 64 bit. that would be 8. And CR2 is irrelevant here because that's a #GP not a #PF. > Here's the disassembly of timer.o > https://codemonkey.org.uk/junk/timer.o.txt > > Sure enough, run_timer_softirq+0x15f (1ccf) is.. > > > entry->next = LIST_POISON2; > 1cc8: 48 b8 00 02 00 00 00movabs $0xdead0200,%rax The code above says: <48> 89 50 08 which is mov %rdx, 0x8(%rax). and that is: next->pprev = pprev inside of detach_timer(). So this is actually inside __hlist_del(struct hlist_node *n) { struct hlist_node *next = n->next; struct hlist_node *pprev = n->pprev; WRITE_ONCE(*pprev, next); if (next) next->pprev = pprev; } The timer which expires has timer->entry.next == POISON2 ! This has nothing to do with tracing, it's a classic list corruption. The bad news is that there is no trace of the culprit because that happens when some other timer expires after some random amount of time. If that is reproducible, then please enable debugobjects. That should pinpoint the culprit. Thanks, tglx
Re: run_timer_softirq gpf. tracing?
On Tue, 21 Mar 2017, Dave Jones wrote: > I just hit this while fuzzing.. > > general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC > CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc2-think+ #1 > task: 88017f0ed440 task.stack: c9094000 > RIP: 0010:run_timer_softirq+0x15f/0x700 > RSP: 0018:880507c03ec8 EFLAGS: 00010086 > RAX: dead0200 RBX: 880507dd0d00 RCX: 0002 > RDX: 880507c03ed0 RSI: RDI: 8204b3a0 > RBP: 880507c03f48 R08: 880507dd12d0 R09: 880507c03ed8 > R10: 880507dd0db0 R11: R12: 8215cc38 > R13: 880507c03ed0 R14: 82005188 R15: 8804b55491a8 > FS: () GS:880507c0() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 0004 CR3: 05011000 CR4: 001406e0 > Call Trace: > > ? clockevents_program_event+0x47/0x120 > __do_softirq+0xbf/0x5b1 > irq_exit+0xb5/0xc0 > smp_apic_timer_interrupt+0x3d/0x50 > apic_timer_interrupt+0x97/0xa0 > RIP: 0010:cpuidle_enter_state+0x12e/0x400 > RSP: 0018:c9097e40 EFLAGS: 0202 > [CONT START] ORIG_RAX: ff10 > RAX: 88017f0ed440 RBX: e8a03cc8 RCX: 0001 > RDX: 20c49ba5e353f7cf RSI: 0001 RDI: 88017f0ed440 > RBP: c9097e80 R08: R09: 0008 > R10: R11: R12: 0005 > R13: 820b9338 R14: 0005 R15: 820b9320 > > cpuidle_enter+0x17/0x20 > call_cpuidle+0x23/0x40 > do_idle+0xfb/0x200 > cpu_startup_entry+0x71/0x80 > start_secondary+0x16a/0x210 > start_cpu+0x14/0x14 > Code: 8b 05 ce 1b ef 7e 83 f8 03 0f 87 4e 01 00 00 89 c0 49 0f a3 04 24 0f 82 > 0a 01 00 00 49 8b 07 49 8b 57 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 41 f6 > 47 2a 20 49 c7 47 08 00 00 00 00 48 89 df 48 > > > RAX looks like list poison, and CR2 = 4, which is likely the ->next of a list, > with a NULL pointer. Certainly not on 64 bit. that would be 8. And CR2 is irrelevant here because that's a #GP not a #PF. > Here's the disassembly of timer.o > https://codemonkey.org.uk/junk/timer.o.txt > > Sure enough, run_timer_softirq+0x15f (1ccf) is.. > > > entry->next = LIST_POISON2; > 1cc8: 48 b8 00 02 00 00 00movabs $0xdead0200,%rax The code above says: <48> 89 50 08 which is mov %rdx, 0x8(%rax). and that is: next->pprev = pprev inside of detach_timer(). So this is actually inside __hlist_del(struct hlist_node *n) { struct hlist_node *next = n->next; struct hlist_node *pprev = n->pprev; WRITE_ONCE(*pprev, next); if (next) next->pprev = pprev; } The timer which expires has timer->entry.next == POISON2 ! This has nothing to do with tracing, it's a classic list corruption. The bad news is that there is no trace of the culprit because that happens when some other timer expires after some random amount of time. If that is reproducible, then please enable debugobjects. That should pinpoint the culprit. Thanks, tglx
Re: run_timer_softirq gpf. tracing?
On Tue, Mar 21, 2017 at 08:25:39PM +0100, Thomas Gleixner wrote: > > RAX looks like list poison, and CR2 = 4, which is likely the ->next of a > > list, > > with a NULL pointer. > > Certainly not on 64 bit. that would be 8. And CR2 is irrelevant here > because that's a #GP not a #PF. doh! > The timer which expires has timer->entry.next == POISON2 ! > > This has nothing to do with tracing, it's a classic list corruption. The > bad news is that there is no trace of the culprit because that happens when > some other timer expires after some random amount of time. ah! Thanks for putting me back on the right path. > If that is reproducible, then please enable debugobjects. That should > pinpoint the culprit. hit it twice today so far, so hopefully it'll reproduce. thanks. Dave
Re: run_timer_softirq gpf. tracing?
On Tue, Mar 21, 2017 at 08:25:39PM +0100, Thomas Gleixner wrote: > > RAX looks like list poison, and CR2 = 4, which is likely the ->next of a > > list, > > with a NULL pointer. > > Certainly not on 64 bit. that would be 8. And CR2 is irrelevant here > because that's a #GP not a #PF. doh! > The timer which expires has timer->entry.next == POISON2 ! > > This has nothing to do with tracing, it's a classic list corruption. The > bad news is that there is no trace of the culprit because that happens when > some other timer expires after some random amount of time. ah! Thanks for putting me back on the right path. > If that is reproducible, then please enable debugobjects. That should > pinpoint the culprit. hit it twice today so far, so hopefully it'll reproduce. thanks. Dave
run_timer_softirq gpf. tracing?
I just hit this while fuzzing.. general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc2-think+ #1 task: 88017f0ed440 task.stack: c9094000 RIP: 0010:run_timer_softirq+0x15f/0x700 RSP: 0018:880507c03ec8 EFLAGS: 00010086 RAX: dead0200 RBX: 880507dd0d00 RCX: 0002 RDX: 880507c03ed0 RSI: RDI: 8204b3a0 RBP: 880507c03f48 R08: 880507dd12d0 R09: 880507c03ed8 R10: 880507dd0db0 R11: R12: 8215cc38 R13: 880507c03ed0 R14: 82005188 R15: 8804b55491a8 FS: () GS:880507c0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0004 CR3: 05011000 CR4: 001406e0 Call Trace: ? clockevents_program_event+0x47/0x120 __do_softirq+0xbf/0x5b1 irq_exit+0xb5/0xc0 smp_apic_timer_interrupt+0x3d/0x50 apic_timer_interrupt+0x97/0xa0 RIP: 0010:cpuidle_enter_state+0x12e/0x400 RSP: 0018:c9097e40 EFLAGS: 0202 [CONT START] ORIG_RAX: ff10 RAX: 88017f0ed440 RBX: e8a03cc8 RCX: 0001 RDX: 20c49ba5e353f7cf RSI: 0001 RDI: 88017f0ed440 RBP: c9097e80 R08: R09: 0008 R10: R11: R12: 0005 R13: 820b9338 R14: 0005 R15: 820b9320 cpuidle_enter+0x17/0x20 call_cpuidle+0x23/0x40 do_idle+0xfb/0x200 cpu_startup_entry+0x71/0x80 start_secondary+0x16a/0x210 start_cpu+0x14/0x14 Code: 8b 05 ce 1b ef 7e 83 f8 03 0f 87 4e 01 00 00 89 c0 49 0f a3 04 24 0f 82 0a 01 00 00 49 8b 07 49 8b 57 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 41 f6 47 2a 20 49 c7 47 08 00 00 00 00 48 89 df 48 RAX looks like list poison, and CR2 = 4, which is likely the ->next of a list, with a NULL pointer. Here's the disassembly of timer.o https://codemonkey.org.uk/junk/timer.o.txt Sure enough, run_timer_softirq+0x15f (1ccf) is.. entry->next = LIST_POISON2; 1cc8: 48 b8 00 02 00 00 00movabs $0xdead0200,%rax 1ccf: 00 ad de This looks to be something to do with the tracing definition of timer_cancel. I don't fully grok how the tracing macros work, is there something obvious here before I dig deeper ? Right before that list op, we do this.. if (timer->flags & TIMER_IRQSAFE) { Which looks like a line from expire_timers. I'm not following how we get there from that tracing definition. No tracing was active when this happened. Dave
run_timer_softirq gpf. tracing?
I just hit this while fuzzing.. general protection fault: [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc2-think+ #1 task: 88017f0ed440 task.stack: c9094000 RIP: 0010:run_timer_softirq+0x15f/0x700 RSP: 0018:880507c03ec8 EFLAGS: 00010086 RAX: dead0200 RBX: 880507dd0d00 RCX: 0002 RDX: 880507c03ed0 RSI: RDI: 8204b3a0 RBP: 880507c03f48 R08: 880507dd12d0 R09: 880507c03ed8 R10: 880507dd0db0 R11: R12: 8215cc38 R13: 880507c03ed0 R14: 82005188 R15: 8804b55491a8 FS: () GS:880507c0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0004 CR3: 05011000 CR4: 001406e0 Call Trace: ? clockevents_program_event+0x47/0x120 __do_softirq+0xbf/0x5b1 irq_exit+0xb5/0xc0 smp_apic_timer_interrupt+0x3d/0x50 apic_timer_interrupt+0x97/0xa0 RIP: 0010:cpuidle_enter_state+0x12e/0x400 RSP: 0018:c9097e40 EFLAGS: 0202 [CONT START] ORIG_RAX: ff10 RAX: 88017f0ed440 RBX: e8a03cc8 RCX: 0001 RDX: 20c49ba5e353f7cf RSI: 0001 RDI: 88017f0ed440 RBP: c9097e80 R08: R09: 0008 R10: R11: R12: 0005 R13: 820b9338 R14: 0005 R15: 820b9320 cpuidle_enter+0x17/0x20 call_cpuidle+0x23/0x40 do_idle+0xfb/0x200 cpu_startup_entry+0x71/0x80 start_secondary+0x16a/0x210 start_cpu+0x14/0x14 Code: 8b 05 ce 1b ef 7e 83 f8 03 0f 87 4e 01 00 00 89 c0 49 0f a3 04 24 0f 82 0a 01 00 00 49 8b 07 49 8b 57 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 41 f6 47 2a 20 49 c7 47 08 00 00 00 00 48 89 df 48 RAX looks like list poison, and CR2 = 4, which is likely the ->next of a list, with a NULL pointer. Here's the disassembly of timer.o https://codemonkey.org.uk/junk/timer.o.txt Sure enough, run_timer_softirq+0x15f (1ccf) is.. entry->next = LIST_POISON2; 1cc8: 48 b8 00 02 00 00 00movabs $0xdead0200,%rax 1ccf: 00 ad de This looks to be something to do with the tracing definition of timer_cancel. I don't fully grok how the tracing macros work, is there something obvious here before I dig deeper ? Right before that list op, we do this.. if (timer->flags & TIMER_IRQSAFE) { Which looks like a line from expire_timers. I'm not following how we get there from that tracing definition. No tracing was active when this happened. Dave