Re: run_timer_softirq gpf. tracing?

2017-03-21 Thread Thomas Gleixner
On Tue, 21 Mar 2017, Dave Jones wrote:
> I just hit this while fuzzing..
> 
> general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc2-think+ #1 
> task: 88017f0ed440 task.stack: c9094000
> RIP: 0010:run_timer_softirq+0x15f/0x700
> RSP: 0018:880507c03ec8 EFLAGS: 00010086
> RAX: dead0200 RBX: 880507dd0d00 RCX: 0002
> RDX: 880507c03ed0 RSI:  RDI: 8204b3a0
> RBP: 880507c03f48 R08: 880507dd12d0 R09: 880507c03ed8
> R10: 880507dd0db0 R11:  R12: 8215cc38
> R13: 880507c03ed0 R14: 82005188 R15: 8804b55491a8
> FS:  () GS:880507c0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0004 CR3: 05011000 CR4: 001406e0
> Call Trace:
>  
>  ? clockevents_program_event+0x47/0x120
>  __do_softirq+0xbf/0x5b1
>  irq_exit+0xb5/0xc0
>  smp_apic_timer_interrupt+0x3d/0x50
>  apic_timer_interrupt+0x97/0xa0
> RIP: 0010:cpuidle_enter_state+0x12e/0x400
> RSP: 0018:c9097e40 EFLAGS: 0202
> [CONT START]  ORIG_RAX: ff10
> RAX: 88017f0ed440 RBX: e8a03cc8 RCX: 0001
> RDX: 20c49ba5e353f7cf RSI: 0001 RDI: 88017f0ed440
> RBP: c9097e80 R08:  R09: 0008
> R10:  R11:  R12: 0005
> R13: 820b9338 R14: 0005 R15: 820b9320
>  
>  cpuidle_enter+0x17/0x20
>  call_cpuidle+0x23/0x40
>  do_idle+0xfb/0x200
>  cpu_startup_entry+0x71/0x80
>  start_secondary+0x16a/0x210
>  start_cpu+0x14/0x14
> Code: 8b 05 ce 1b ef 7e 83 f8 03 0f 87 4e 01 00 00 89 c0 49 0f a3 04 24 0f 82 
> 0a 01 00 00 49 8b 07 49 8b 57 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 41 f6 
> 47 2a 20 49 c7 47 08 00 00 00 00 48 89 df 48 
> 
> 
> RAX looks like list poison, and CR2 = 4, which is likely the ->next of a list,
> with a NULL pointer.

Certainly not on 64 bit. that would be 8. And CR2 is irrelevant here
because that's a #GP not a #PF.

> Here's the disassembly of timer.o
> https://codemonkey.org.uk/junk/timer.o.txt
> 
> Sure enough, run_timer_softirq+0x15f (1ccf) is..
> 
> 
>   entry->next = LIST_POISON2;
> 1cc8: 48 b8 00 02 00 00 00movabs $0xdead0200,%rax

The code above says: <48> 89 50 08 which is

  mov %rdx, 0x8(%rax).

and that is:

   next->pprev = pprev

inside of detach_timer().

So this is actually inside

__hlist_del(struct hlist_node *n)
{
   struct hlist_node *next = n->next;
   struct hlist_node *pprev = n->pprev;

   WRITE_ONCE(*pprev, next);
   if (next)
  next->pprev = pprev;
}

The timer which expires has timer->entry.next == POISON2 !

This has nothing to do with tracing, it's a classic list corruption.  The
bad news is that there is no trace of the culprit because that happens when
some other timer expires after some random amount of time.

If that is reproducible, then please enable debugobjects. That should
pinpoint the culprit.

Thanks,

tglx


Re: run_timer_softirq gpf. tracing?

2017-03-21 Thread Thomas Gleixner
On Tue, 21 Mar 2017, Dave Jones wrote:
> I just hit this while fuzzing..
> 
> general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc2-think+ #1 
> task: 88017f0ed440 task.stack: c9094000
> RIP: 0010:run_timer_softirq+0x15f/0x700
> RSP: 0018:880507c03ec8 EFLAGS: 00010086
> RAX: dead0200 RBX: 880507dd0d00 RCX: 0002
> RDX: 880507c03ed0 RSI:  RDI: 8204b3a0
> RBP: 880507c03f48 R08: 880507dd12d0 R09: 880507c03ed8
> R10: 880507dd0db0 R11:  R12: 8215cc38
> R13: 880507c03ed0 R14: 82005188 R15: 8804b55491a8
> FS:  () GS:880507c0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0004 CR3: 05011000 CR4: 001406e0
> Call Trace:
>  
>  ? clockevents_program_event+0x47/0x120
>  __do_softirq+0xbf/0x5b1
>  irq_exit+0xb5/0xc0
>  smp_apic_timer_interrupt+0x3d/0x50
>  apic_timer_interrupt+0x97/0xa0
> RIP: 0010:cpuidle_enter_state+0x12e/0x400
> RSP: 0018:c9097e40 EFLAGS: 0202
> [CONT START]  ORIG_RAX: ff10
> RAX: 88017f0ed440 RBX: e8a03cc8 RCX: 0001
> RDX: 20c49ba5e353f7cf RSI: 0001 RDI: 88017f0ed440
> RBP: c9097e80 R08:  R09: 0008
> R10:  R11:  R12: 0005
> R13: 820b9338 R14: 0005 R15: 820b9320
>  
>  cpuidle_enter+0x17/0x20
>  call_cpuidle+0x23/0x40
>  do_idle+0xfb/0x200
>  cpu_startup_entry+0x71/0x80
>  start_secondary+0x16a/0x210
>  start_cpu+0x14/0x14
> Code: 8b 05 ce 1b ef 7e 83 f8 03 0f 87 4e 01 00 00 89 c0 49 0f a3 04 24 0f 82 
> 0a 01 00 00 49 8b 07 49 8b 57 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 41 f6 
> 47 2a 20 49 c7 47 08 00 00 00 00 48 89 df 48 
> 
> 
> RAX looks like list poison, and CR2 = 4, which is likely the ->next of a list,
> with a NULL pointer.

Certainly not on 64 bit. that would be 8. And CR2 is irrelevant here
because that's a #GP not a #PF.

> Here's the disassembly of timer.o
> https://codemonkey.org.uk/junk/timer.o.txt
> 
> Sure enough, run_timer_softirq+0x15f (1ccf) is..
> 
> 
>   entry->next = LIST_POISON2;
> 1cc8: 48 b8 00 02 00 00 00movabs $0xdead0200,%rax

The code above says: <48> 89 50 08 which is

  mov %rdx, 0x8(%rax).

and that is:

   next->pprev = pprev

inside of detach_timer().

So this is actually inside

__hlist_del(struct hlist_node *n)
{
   struct hlist_node *next = n->next;
   struct hlist_node *pprev = n->pprev;

   WRITE_ONCE(*pprev, next);
   if (next)
  next->pprev = pprev;
}

The timer which expires has timer->entry.next == POISON2 !

This has nothing to do with tracing, it's a classic list corruption.  The
bad news is that there is no trace of the culprit because that happens when
some other timer expires after some random amount of time.

If that is reproducible, then please enable debugobjects. That should
pinpoint the culprit.

Thanks,

tglx


Re: run_timer_softirq gpf. tracing?

2017-03-21 Thread Dave Jones
On Tue, Mar 21, 2017 at 08:25:39PM +0100, Thomas Gleixner wrote:

 > > RAX looks like list poison, and CR2 = 4, which is likely the ->next of a 
 > > list,
 > > with a NULL pointer.
 > 
 > Certainly not on 64 bit. that would be 8. And CR2 is irrelevant here
 > because that's a #GP not a #PF.

doh!

 > The timer which expires has timer->entry.next == POISON2 !
 > 
 > This has nothing to do with tracing, it's a classic list corruption.  The
 > bad news is that there is no trace of the culprit because that happens when
 > some other timer expires after some random amount of time.

ah! Thanks for putting me back on the right path.

 > If that is reproducible, then please enable debugobjects. That should
 > pinpoint the culprit.

hit it twice today so far, so hopefully it'll reproduce.

thanks.

Dave


Re: run_timer_softirq gpf. tracing?

2017-03-21 Thread Dave Jones
On Tue, Mar 21, 2017 at 08:25:39PM +0100, Thomas Gleixner wrote:

 > > RAX looks like list poison, and CR2 = 4, which is likely the ->next of a 
 > > list,
 > > with a NULL pointer.
 > 
 > Certainly not on 64 bit. that would be 8. And CR2 is irrelevant here
 > because that's a #GP not a #PF.

doh!

 > The timer which expires has timer->entry.next == POISON2 !
 > 
 > This has nothing to do with tracing, it's a classic list corruption.  The
 > bad news is that there is no trace of the culprit because that happens when
 > some other timer expires after some random amount of time.

ah! Thanks for putting me back on the right path.

 > If that is reproducible, then please enable debugobjects. That should
 > pinpoint the culprit.

hit it twice today so far, so hopefully it'll reproduce.

thanks.

Dave


run_timer_softirq gpf. tracing?

2017-03-21 Thread Dave Jones
I just hit this while fuzzing..

general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc2-think+ #1 
task: 88017f0ed440 task.stack: c9094000
RIP: 0010:run_timer_softirq+0x15f/0x700
RSP: 0018:880507c03ec8 EFLAGS: 00010086
RAX: dead0200 RBX: 880507dd0d00 RCX: 0002
RDX: 880507c03ed0 RSI:  RDI: 8204b3a0
RBP: 880507c03f48 R08: 880507dd12d0 R09: 880507c03ed8
R10: 880507dd0db0 R11:  R12: 8215cc38
R13: 880507c03ed0 R14: 82005188 R15: 8804b55491a8
FS:  () GS:880507c0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0004 CR3: 05011000 CR4: 001406e0
Call Trace:
 
 ? clockevents_program_event+0x47/0x120
 __do_softirq+0xbf/0x5b1
 irq_exit+0xb5/0xc0
 smp_apic_timer_interrupt+0x3d/0x50
 apic_timer_interrupt+0x97/0xa0
RIP: 0010:cpuidle_enter_state+0x12e/0x400
RSP: 0018:c9097e40 EFLAGS: 0202
[CONT START]  ORIG_RAX: ff10
RAX: 88017f0ed440 RBX: e8a03cc8 RCX: 0001
RDX: 20c49ba5e353f7cf RSI: 0001 RDI: 88017f0ed440
RBP: c9097e80 R08:  R09: 0008
R10:  R11:  R12: 0005
R13: 820b9338 R14: 0005 R15: 820b9320
 
 cpuidle_enter+0x17/0x20
 call_cpuidle+0x23/0x40
 do_idle+0xfb/0x200
 cpu_startup_entry+0x71/0x80
 start_secondary+0x16a/0x210
 start_cpu+0x14/0x14
Code: 8b 05 ce 1b ef 7e 83 f8 03 0f 87 4e 01 00 00 89 c0 49 0f a3 04 24 0f 82 
0a 01 00 00 49 8b 07 49 8b 57 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 41 f6 47 
2a 20 49 c7 47 08 00 00 00 00 48 89 df 48 


RAX looks like list poison, and CR2 = 4, which is likely the ->next of a list,
with a NULL pointer.

Here's the disassembly of timer.o
https://codemonkey.org.uk/junk/timer.o.txt

Sure enough, run_timer_softirq+0x15f (1ccf) is..


entry->next = LIST_POISON2;
1cc8:   48 b8 00 02 00 00 00movabs $0xdead0200,%rax
1ccf:   00 ad de 


This looks to be something to do with the tracing definition of timer_cancel.
I don't fully grok how the tracing macros work, is there something obvious
here before I dig deeper ?

Right before that list op, we do this..

if (timer->flags & TIMER_IRQSAFE) {

Which looks like a line from expire_timers.   I'm not following how we get there
from that tracing definition.

No tracing was active when this happened.

Dave



run_timer_softirq gpf. tracing?

2017-03-21 Thread Dave Jones
I just hit this while fuzzing..

general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc2-think+ #1 
task: 88017f0ed440 task.stack: c9094000
RIP: 0010:run_timer_softirq+0x15f/0x700
RSP: 0018:880507c03ec8 EFLAGS: 00010086
RAX: dead0200 RBX: 880507dd0d00 RCX: 0002
RDX: 880507c03ed0 RSI:  RDI: 8204b3a0
RBP: 880507c03f48 R08: 880507dd12d0 R09: 880507c03ed8
R10: 880507dd0db0 R11:  R12: 8215cc38
R13: 880507c03ed0 R14: 82005188 R15: 8804b55491a8
FS:  () GS:880507c0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0004 CR3: 05011000 CR4: 001406e0
Call Trace:
 
 ? clockevents_program_event+0x47/0x120
 __do_softirq+0xbf/0x5b1
 irq_exit+0xb5/0xc0
 smp_apic_timer_interrupt+0x3d/0x50
 apic_timer_interrupt+0x97/0xa0
RIP: 0010:cpuidle_enter_state+0x12e/0x400
RSP: 0018:c9097e40 EFLAGS: 0202
[CONT START]  ORIG_RAX: ff10
RAX: 88017f0ed440 RBX: e8a03cc8 RCX: 0001
RDX: 20c49ba5e353f7cf RSI: 0001 RDI: 88017f0ed440
RBP: c9097e80 R08:  R09: 0008
R10:  R11:  R12: 0005
R13: 820b9338 R14: 0005 R15: 820b9320
 
 cpuidle_enter+0x17/0x20
 call_cpuidle+0x23/0x40
 do_idle+0xfb/0x200
 cpu_startup_entry+0x71/0x80
 start_secondary+0x16a/0x210
 start_cpu+0x14/0x14
Code: 8b 05 ce 1b ef 7e 83 f8 03 0f 87 4e 01 00 00 89 c0 49 0f a3 04 24 0f 82 
0a 01 00 00 49 8b 07 49 8b 57 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 41 f6 47 
2a 20 49 c7 47 08 00 00 00 00 48 89 df 48 


RAX looks like list poison, and CR2 = 4, which is likely the ->next of a list,
with a NULL pointer.

Here's the disassembly of timer.o
https://codemonkey.org.uk/junk/timer.o.txt

Sure enough, run_timer_softirq+0x15f (1ccf) is..


entry->next = LIST_POISON2;
1cc8:   48 b8 00 02 00 00 00movabs $0xdead0200,%rax
1ccf:   00 ad de 


This looks to be something to do with the tracing definition of timer_cancel.
I don't fully grok how the tracing macros work, is there something obvious
here before I dig deeper ?

Right before that list op, we do this..

if (timer->flags & TIMER_IRQSAFE) {

Which looks like a line from expire_timers.   I'm not following how we get there
from that tracing definition.

No tracing was active when this happened.

Dave