Re: general protection fault in dst_destroy() - 4.13.9

2017-11-26 Thread Anders K. Pedersen | Cohaesio
On man, 2017-11-20 at 17:13 +0200, Ido Schimmel wrote:
> On Sun, Nov 19, 2017 at 12:45:41PM +, Anders K. Pedersen |
> Cohaesio wrote:
> > Hello,
> > 
> > A few days ago, one of our routers (running Linux 4.13.9) crashed
> > due
> > to a general protection fault in dst_destroy(). At the time, it had
> > run
> > for several weeks without any problems, but then crashed three
> > times in
> > a row within a few minutes - all due to a general protection fault
> > at
> > dst_destroy()+0x35. Since then, it has run for several days without
> > any
> > further problems, so I suspect that this was triggered by a traffic
> > pattern in the routed packets, but I don't have a way to reproduce
> > it.
> > 
> > Disassembly shows that this is in the inlined dev_put(), which does
> > this_cpu_dec(*dev->pcpu_refcnt). As far as I can tell there haven't
> > been any fixes in this area since 4.13, and a Google search didn't
> > find
> > anything recent, so I'm guessing this is not a known problem.
> > 
> > I have included the kernel output via serial console below as well
> > as
> > gdb and objdump information. Please let me know, if I can provide
> > any
> > additional information.
> > 
> > 
> > [2024260.461401] general protection fault:  [#1] SMP
> > [2024260.467193] Modules linked in:
> > [2024260.470897] CPU: 15 PID: 0 Comm: swapper/15 Tainted:
> > GW   4.13.9 #2
> > [2024260.479488] Hardware name: Dell Inc. PowerEdge R730/0H21J3,
> > BIOS 2.5.5 08/16/2017
> > [2024260.488279] task: 88085b625cc0 task.stack:
> > c90e4000
> > [2024260.495277] RIP: 0010:dst_destroy+0x35/0xa0
> > [2024260.500277] RSP: 0018:88085f5c3f08 EFLAGS: 00010286
> > [2024260.506474] RAX: 88085ac0e880 RBX: 88082cf9fb00 RCX:
> > 0020
> > [2024260.514868] RDX: 88082cf9fbc0 RSI:  RDI:
> > 816786c0
> > [2024260.523258] RBP:  R08: ff00 R09:
> > 
> > [2024260.531649] R10:  R11:  R12:
> > 88085f5da678
> > [2024260.540040] R13: 000a R14: 88085b625cc0 R15:
> > 88085b625cc0
> > [2024260.548431] FS:  ()
> > GS:88085f5c() knlGS:
> > [2024260.557924] CS:  0010 DS:  ES:  CR0: 80050033
> > [2024260.564719] CR2: 7fc800e48e88 CR3: 01809000 CR4:
> > 001406e0
> > [2024260.573112] Call Trace:
> > [2024260.576113]  
> > [2024260.578618]  ? rcu_process_callbacks+0x18f/0x460
> > [2024260.584126]  ? rebalance_domains+0xe2/0x290
> > [2024260.589128]  ? __do_softirq+0x100/0x292
> > [2024260.593727]  ? irq_exit+0x92/0xa0
> > [2024260.597729]  ? smp_apic_timer_interrupt+0x39/0x50
> > [2024260.603328]  ? apic_timer_interrupt+0x7c/0x90
> > [2024260.608528]  
> > [2024260.611134]  ? cpuidle_enter_state+0x14c/0x2b0
> > [2024260.616432]  ? cpuidle_enter_state+0x128/0x2b0
> > [2024260.621731]  ? do_idle+0xf9/0x190
> > [2024260.625733]  ? cpu_startup_entry+0x5f/0x70
> > [2024260.630636]  ? start_secondary+0x12a/0x130
> > [2024260.635536]  ? secondary_startup_64+0x9f/0x9f
> > [2024260.640731] Code: f6 47 60 08 48 8b 6f 18 74 62 48 8b 43 20 48
> > 8b 40 30 48 85 c0 74 05 48
> > 89 df ff d0 48 8b 03 48 85 c0 74 0a 48 8b 80 e0 03 00 00 <65> ff 08
> > f6 43 60 80 74 26 48 8d bb
> > e0 00 00 00 e8 e6 7f 01 00
> > [2024260.662626] RIP: dst_destroy+0x35/0xa0 RSP: 88085f5c3f08
> > [2024260.669333] ---[ end trace 3c1827251806827c ]---
> > [2024260.724173] Kernel panic - not syncing: Fatal exception in
> > interrupt
> > [2024261.102792] Kernel Offset: disabled
> > [2024261.156022] Rebooting in 60 seconds..
> > [2024321.167958] ACPI MEMORY or I/O RESET_REG.
> 
> This looks very similar to a bug Eric already fixed here:
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/co
> mmit/?id=222d7dbd258dad4cd5241c43ef818141fad5a87a
> 
> I don't see it in v4.13.9 which might explain why you're still
> hitting
> it. Can you please try to reproduce with mentioned patch?

Yes, it looks like it could be related. I see that it is included in
v4.14, so we'll update to that and see if it comes back.

Thanks,
Anders

Re: general protection fault in dst_destroy() - 4.13.9

2017-11-20 Thread Ido Schimmel
On Sun, Nov 19, 2017 at 12:45:41PM +, Anders K. Pedersen | Cohaesio wrote:
> Hello,
> 
> A few days ago, one of our routers (running Linux 4.13.9) crashed due
> to a general protection fault in dst_destroy(). At the time, it had run
> for several weeks without any problems, but then crashed three times in
> a row within a few minutes - all due to a general protection fault at
> dst_destroy()+0x35. Since then, it has run for several days without any
> further problems, so I suspect that this was triggered by a traffic
> pattern in the routed packets, but I don't have a way to reproduce it.
> 
> Disassembly shows that this is in the inlined dev_put(), which does
> this_cpu_dec(*dev->pcpu_refcnt). As far as I can tell there haven't
> been any fixes in this area since 4.13, and a Google search didn't find
> anything recent, so I'm guessing this is not a known problem.
> 
> I have included the kernel output via serial console below as well as
> gdb and objdump information. Please let me know, if I can provide any
> additional information.
> 
> 
> [2024260.461401] general protection fault:  [#1] SMP
> [2024260.467193] Modules linked in:
> [2024260.470897] CPU: 15 PID: 0 Comm: swapper/15 Tainted: GW   
> 4.13.9 #2
> [2024260.479488] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 2.5.5 
> 08/16/2017
> [2024260.488279] task: 88085b625cc0 task.stack: c90e4000
> [2024260.495277] RIP: 0010:dst_destroy+0x35/0xa0
> [2024260.500277] RSP: 0018:88085f5c3f08 EFLAGS: 00010286
> [2024260.506474] RAX: 88085ac0e880 RBX: 88082cf9fb00 RCX: 
> 0020
> [2024260.514868] RDX: 88082cf9fbc0 RSI:  RDI: 
> 816786c0
> [2024260.523258] RBP:  R08: ff00 R09: 
> 
> [2024260.531649] R10:  R11:  R12: 
> 88085f5da678
> [2024260.540040] R13: 000a R14: 88085b625cc0 R15: 
> 88085b625cc0
> [2024260.548431] FS:  () GS:88085f5c() 
> knlGS:
> [2024260.557924] CS:  0010 DS:  ES:  CR0: 80050033
> [2024260.564719] CR2: 7fc800e48e88 CR3: 01809000 CR4: 
> 001406e0
> [2024260.573112] Call Trace:
> [2024260.576113]  
> [2024260.578618]  ? rcu_process_callbacks+0x18f/0x460
> [2024260.584126]  ? rebalance_domains+0xe2/0x290
> [2024260.589128]  ? __do_softirq+0x100/0x292
> [2024260.593727]  ? irq_exit+0x92/0xa0
> [2024260.597729]  ? smp_apic_timer_interrupt+0x39/0x50
> [2024260.603328]  ? apic_timer_interrupt+0x7c/0x90
> [2024260.608528]  
> [2024260.611134]  ? cpuidle_enter_state+0x14c/0x2b0
> [2024260.616432]  ? cpuidle_enter_state+0x128/0x2b0
> [2024260.621731]  ? do_idle+0xf9/0x190
> [2024260.625733]  ? cpu_startup_entry+0x5f/0x70
> [2024260.630636]  ? start_secondary+0x12a/0x130
> [2024260.635536]  ? secondary_startup_64+0x9f/0x9f
> [2024260.640731] Code: f6 47 60 08 48 8b 6f 18 74 62 48 8b 43 20 48 8b 40 30 
> 48 85 c0 74 05 48
> 89 df ff d0 48 8b 03 48 85 c0 74 0a 48 8b 80 e0 03 00 00 <65> ff 08 f6 43 60 
> 80 74 26 48 8d bb
> e0 00 00 00 e8 e6 7f 01 00
> [2024260.662626] RIP: dst_destroy+0x35/0xa0 RSP: 88085f5c3f08
> [2024260.669333] ---[ end trace 3c1827251806827c ]---
> [2024260.724173] Kernel panic - not syncing: Fatal exception in interrupt
> [2024261.102792] Kernel Offset: disabled
> [2024261.156022] Rebooting in 60 seconds..
> [2024321.167958] ACPI MEMORY or I/O RESET_REG.

This looks very similar to a bug Eric already fixed here:
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=222d7dbd258dad4cd5241c43ef818141fad5a87a

I don't see it in v4.13.9 which might explain why you're still hitting
it. Can you please try to reproduce with mentioned patch?

Thanks


general protection fault in dst_destroy() - 4.13.9

2017-11-19 Thread Anders K. Pedersen | Cohaesio
Hello,

A few days ago, one of our routers (running Linux 4.13.9) crashed due
to a general protection fault in dst_destroy(). At the time, it had run
for several weeks without any problems, but then crashed three times in
a row within a few minutes - all due to a general protection fault at
dst_destroy()+0x35. Since then, it has run for several days without any
further problems, so I suspect that this was triggered by a traffic
pattern in the routed packets, but I don't have a way to reproduce it.

Disassembly shows that this is in the inlined dev_put(), which does
this_cpu_dec(*dev->pcpu_refcnt). As far as I can tell there haven't
been any fixes in this area since 4.13, and a Google search didn't find
anything recent, so I'm guessing this is not a known problem.

I have included the kernel output via serial console below as well as
gdb and objdump information. Please let me know, if I can provide any
additional information.


[2024260.461401] general protection fault:  [#1] SMP
[2024260.467193] Modules linked in:
[2024260.470897] CPU: 15 PID: 0 Comm: swapper/15 Tainted: GW   
4.13.9 #2
[2024260.479488] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 2.5.5 
08/16/2017
[2024260.488279] task: 88085b625cc0 task.stack: c90e4000
[2024260.495277] RIP: 0010:dst_destroy+0x35/0xa0
[2024260.500277] RSP: 0018:88085f5c3f08 EFLAGS: 00010286
[2024260.506474] RAX: 88085ac0e880 RBX: 88082cf9fb00 RCX: 
0020
[2024260.514868] RDX: 88082cf9fbc0 RSI:  RDI: 
816786c0
[2024260.523258] RBP:  R08: ff00 R09: 

[2024260.531649] R10:  R11:  R12: 
88085f5da678
[2024260.540040] R13: 000a R14: 88085b625cc0 R15: 
88085b625cc0
[2024260.548431] FS:  () GS:88085f5c() 
knlGS:
[2024260.557924] CS:  0010 DS:  ES:  CR0: 80050033
[2024260.564719] CR2: 7fc800e48e88 CR3: 01809000 CR4: 
001406e0
[2024260.573112] Call Trace:
[2024260.576113]  
[2024260.578618]  ? rcu_process_callbacks+0x18f/0x460
[2024260.584126]  ? rebalance_domains+0xe2/0x290
[2024260.589128]  ? __do_softirq+0x100/0x292
[2024260.593727]  ? irq_exit+0x92/0xa0
[2024260.597729]  ? smp_apic_timer_interrupt+0x39/0x50
[2024260.603328]  ? apic_timer_interrupt+0x7c/0x90
[2024260.608528]  
[2024260.611134]  ? cpuidle_enter_state+0x14c/0x2b0
[2024260.616432]  ? cpuidle_enter_state+0x128/0x2b0
[2024260.621731]  ? do_idle+0xf9/0x190
[2024260.625733]  ? cpu_startup_entry+0x5f/0x70
[2024260.630636]  ? start_secondary+0x12a/0x130
[2024260.635536]  ? secondary_startup_64+0x9f/0x9f
[2024260.640731] Code: f6 47 60 08 48 8b 6f 18 74 62 48 8b 43 20 48 8b 40 30 48 
85 c0 74 05 48
89 df ff d0 48 8b 03 48 85 c0 74 0a 48 8b 80 e0 03 00 00 <65> ff 08 f6 43 60 80 
74 26 48 8d bb
e0 00 00 00 e8 e6 7f 01 00
[2024260.662626] RIP: dst_destroy+0x35/0xa0 RSP: 88085f5c3f08
[2024260.669333] ---[ end trace 3c1827251806827c ]---
[2024260.724173] Kernel panic - not syncing: Fatal exception in interrupt
[2024261.102792] Kernel Offset: disabled
[2024261.156022] Rebooting in 60 seconds..
[2024321.167958] ACPI MEMORY or I/O RESET_REG.


[   36.620034] general protection fault:  [#1] SMP
[   36.625637] Modules linked in:
[   36.629141] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.13.9 #2
[   36.635938] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 2.5.5 
08/16/2017
[   36.644532] task: 88085b46a7c0 task.stack: c907c000
[   36.651333] RIP: 0010:dst_destroy+0x35/0xa0
[   36.656133] RSP: 0018:88085f283f08 EFLAGS: 00010286
[   36.662133] RAX: 2e37307830203a65 RBX: 88082ac1 RCX: 0020
[   36.670326] RDX: 88082ac100c0 RSI:  RDI: 816786c0
[   36.678521] RBP:  R08: 30e3e201 R09: 00010080007a
[   36.686714] R10: 88085f283e20 R11: ea0020c38e00 R12: 88085f29a678
[   36.694906] R13: 000a R14: 88085b46a7c0 R15: 88085b46a7c0
[   36.703102] FS:  () GS:88085f28() 
knlGS:
[   36.712395] CS:  0010 DS:  ES:  CR0: 80050033
[   36.718992] CR2: 55568c725558 CR3: 01809000 CR4: 001406e0
[   36.727184] Call Trace:
[   36.729987]  
[   36.732287]  ? rcu_process_callbacks+0x18f/0x460
[   36.737588]  ? rebalance_domains+0xe2/0x290
[   36.742388]  ? __do_softirq+0x100/0x292
[   36.746790]  ? irq_exit+0x92/0xa0
[   36.750590]  ? smp_apic_timer_interrupt+0x39/0x50
[   36.755990]  ? apic_timer_interrupt+0x7c/0x90
[   36.760987]  
[   36.763392]  ? poll_idle+0x46/0x7a
[   36.767295]  ? cpuidle_enter_state+0x102/0x2b0
[   36.772396]  ? do_idle+0xf9/0x190
[   36.776197]  ? cpu_startup_entry+0x5f/0x70
[   36.780892]  ? start_secondary+0x12a/0x130
[   36.785592]  ? secondary_startup_64+0x9f/0x9f
[   36.790590] Code: f6 47 60 08 48 8b 6f 18 74 62 48 8b 43 20 48 8b 40 30 48 
85