Re: general protection fault in dst_destroy() - 4.13.9
On man, 2017-11-20 at 17:13 +0200, Ido Schimmel wrote: > On Sun, Nov 19, 2017 at 12:45:41PM +, Anders K. Pedersen | > Cohaesio wrote: > > Hello, > > > > A few days ago, one of our routers (running Linux 4.13.9) crashed > > due > > to a general protection fault in dst_destroy(). At the time, it had > > run > > for several weeks without any problems, but then crashed three > > times in > > a row within a few minutes - all due to a general protection fault > > at > > dst_destroy()+0x35. Since then, it has run for several days without > > any > > further problems, so I suspect that this was triggered by a traffic > > pattern in the routed packets, but I don't have a way to reproduce > > it. > > > > Disassembly shows that this is in the inlined dev_put(), which does > > this_cpu_dec(*dev->pcpu_refcnt). As far as I can tell there haven't > > been any fixes in this area since 4.13, and a Google search didn't > > find > > anything recent, so I'm guessing this is not a known problem. > > > > I have included the kernel output via serial console below as well > > as > > gdb and objdump information. Please let me know, if I can provide > > any > > additional information. > > > > > > [2024260.461401] general protection fault: [#1] SMP > > [2024260.467193] Modules linked in: > > [2024260.470897] CPU: 15 PID: 0 Comm: swapper/15 Tainted: > > GW 4.13.9 #2 > > [2024260.479488] Hardware name: Dell Inc. PowerEdge R730/0H21J3, > > BIOS 2.5.5 08/16/2017 > > [2024260.488279] task: 88085b625cc0 task.stack: > > c90e4000 > > [2024260.495277] RIP: 0010:dst_destroy+0x35/0xa0 > > [2024260.500277] RSP: 0018:88085f5c3f08 EFLAGS: 00010286 > > [2024260.506474] RAX: 88085ac0e880 RBX: 88082cf9fb00 RCX: > > 0020 > > [2024260.514868] RDX: 88082cf9fbc0 RSI: RDI: > > 816786c0 > > [2024260.523258] RBP: R08: ff00 R09: > > > > [2024260.531649] R10: R11: R12: > > 88085f5da678 > > [2024260.540040] R13: 000a R14: 88085b625cc0 R15: > > 88085b625cc0 > > [2024260.548431] FS: () > > GS:88085f5c() knlGS: > > [2024260.557924] CS: 0010 DS: ES: CR0: 80050033 > > [2024260.564719] CR2: 7fc800e48e88 CR3: 01809000 CR4: > > 001406e0 > > [2024260.573112] Call Trace: > > [2024260.576113] > > [2024260.578618] ? rcu_process_callbacks+0x18f/0x460 > > [2024260.584126] ? rebalance_domains+0xe2/0x290 > > [2024260.589128] ? __do_softirq+0x100/0x292 > > [2024260.593727] ? irq_exit+0x92/0xa0 > > [2024260.597729] ? smp_apic_timer_interrupt+0x39/0x50 > > [2024260.603328] ? apic_timer_interrupt+0x7c/0x90 > > [2024260.608528] > > [2024260.611134] ? cpuidle_enter_state+0x14c/0x2b0 > > [2024260.616432] ? cpuidle_enter_state+0x128/0x2b0 > > [2024260.621731] ? do_idle+0xf9/0x190 > > [2024260.625733] ? cpu_startup_entry+0x5f/0x70 > > [2024260.630636] ? start_secondary+0x12a/0x130 > > [2024260.635536] ? secondary_startup_64+0x9f/0x9f > > [2024260.640731] Code: f6 47 60 08 48 8b 6f 18 74 62 48 8b 43 20 48 > > 8b 40 30 48 85 c0 74 05 48 > > 89 df ff d0 48 8b 03 48 85 c0 74 0a 48 8b 80 e0 03 00 00 <65> ff 08 > > f6 43 60 80 74 26 48 8d bb > > e0 00 00 00 e8 e6 7f 01 00 > > [2024260.662626] RIP: dst_destroy+0x35/0xa0 RSP: 88085f5c3f08 > > [2024260.669333] ---[ end trace 3c1827251806827c ]--- > > [2024260.724173] Kernel panic - not syncing: Fatal exception in > > interrupt > > [2024261.102792] Kernel Offset: disabled > > [2024261.156022] Rebooting in 60 seconds.. > > [2024321.167958] ACPI MEMORY or I/O RESET_REG. > > This looks very similar to a bug Eric already fixed here: > https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/co > mmit/?id=222d7dbd258dad4cd5241c43ef818141fad5a87a > > I don't see it in v4.13.9 which might explain why you're still > hitting > it. Can you please try to reproduce with mentioned patch? Yes, it looks like it could be related. I see that it is included in v4.14, so we'll update to that and see if it comes back. Thanks, Anders
Re: general protection fault in dst_destroy() - 4.13.9
On Sun, Nov 19, 2017 at 12:45:41PM +, Anders K. Pedersen | Cohaesio wrote: > Hello, > > A few days ago, one of our routers (running Linux 4.13.9) crashed due > to a general protection fault in dst_destroy(). At the time, it had run > for several weeks without any problems, but then crashed three times in > a row within a few minutes - all due to a general protection fault at > dst_destroy()+0x35. Since then, it has run for several days without any > further problems, so I suspect that this was triggered by a traffic > pattern in the routed packets, but I don't have a way to reproduce it. > > Disassembly shows that this is in the inlined dev_put(), which does > this_cpu_dec(*dev->pcpu_refcnt). As far as I can tell there haven't > been any fixes in this area since 4.13, and a Google search didn't find > anything recent, so I'm guessing this is not a known problem. > > I have included the kernel output via serial console below as well as > gdb and objdump information. Please let me know, if I can provide any > additional information. > > > [2024260.461401] general protection fault: [#1] SMP > [2024260.467193] Modules linked in: > [2024260.470897] CPU: 15 PID: 0 Comm: swapper/15 Tainted: GW > 4.13.9 #2 > [2024260.479488] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 2.5.5 > 08/16/2017 > [2024260.488279] task: 88085b625cc0 task.stack: c90e4000 > [2024260.495277] RIP: 0010:dst_destroy+0x35/0xa0 > [2024260.500277] RSP: 0018:88085f5c3f08 EFLAGS: 00010286 > [2024260.506474] RAX: 88085ac0e880 RBX: 88082cf9fb00 RCX: > 0020 > [2024260.514868] RDX: 88082cf9fbc0 RSI: RDI: > 816786c0 > [2024260.523258] RBP: R08: ff00 R09: > > [2024260.531649] R10: R11: R12: > 88085f5da678 > [2024260.540040] R13: 000a R14: 88085b625cc0 R15: > 88085b625cc0 > [2024260.548431] FS: () GS:88085f5c() > knlGS: > [2024260.557924] CS: 0010 DS: ES: CR0: 80050033 > [2024260.564719] CR2: 7fc800e48e88 CR3: 01809000 CR4: > 001406e0 > [2024260.573112] Call Trace: > [2024260.576113] > [2024260.578618] ? rcu_process_callbacks+0x18f/0x460 > [2024260.584126] ? rebalance_domains+0xe2/0x290 > [2024260.589128] ? __do_softirq+0x100/0x292 > [2024260.593727] ? irq_exit+0x92/0xa0 > [2024260.597729] ? smp_apic_timer_interrupt+0x39/0x50 > [2024260.603328] ? apic_timer_interrupt+0x7c/0x90 > [2024260.608528] > [2024260.611134] ? cpuidle_enter_state+0x14c/0x2b0 > [2024260.616432] ? cpuidle_enter_state+0x128/0x2b0 > [2024260.621731] ? do_idle+0xf9/0x190 > [2024260.625733] ? cpu_startup_entry+0x5f/0x70 > [2024260.630636] ? start_secondary+0x12a/0x130 > [2024260.635536] ? secondary_startup_64+0x9f/0x9f > [2024260.640731] Code: f6 47 60 08 48 8b 6f 18 74 62 48 8b 43 20 48 8b 40 30 > 48 85 c0 74 05 48 > 89 df ff d0 48 8b 03 48 85 c0 74 0a 48 8b 80 e0 03 00 00 <65> ff 08 f6 43 60 > 80 74 26 48 8d bb > e0 00 00 00 e8 e6 7f 01 00 > [2024260.662626] RIP: dst_destroy+0x35/0xa0 RSP: 88085f5c3f08 > [2024260.669333] ---[ end trace 3c1827251806827c ]--- > [2024260.724173] Kernel panic - not syncing: Fatal exception in interrupt > [2024261.102792] Kernel Offset: disabled > [2024261.156022] Rebooting in 60 seconds.. > [2024321.167958] ACPI MEMORY or I/O RESET_REG. This looks very similar to a bug Eric already fixed here: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=222d7dbd258dad4cd5241c43ef818141fad5a87a I don't see it in v4.13.9 which might explain why you're still hitting it. Can you please try to reproduce with mentioned patch? Thanks
general protection fault in dst_destroy() - 4.13.9
Hello, A few days ago, one of our routers (running Linux 4.13.9) crashed due to a general protection fault in dst_destroy(). At the time, it had run for several weeks without any problems, but then crashed three times in a row within a few minutes - all due to a general protection fault at dst_destroy()+0x35. Since then, it has run for several days without any further problems, so I suspect that this was triggered by a traffic pattern in the routed packets, but I don't have a way to reproduce it. Disassembly shows that this is in the inlined dev_put(), which does this_cpu_dec(*dev->pcpu_refcnt). As far as I can tell there haven't been any fixes in this area since 4.13, and a Google search didn't find anything recent, so I'm guessing this is not a known problem. I have included the kernel output via serial console below as well as gdb and objdump information. Please let me know, if I can provide any additional information. [2024260.461401] general protection fault: [#1] SMP [2024260.467193] Modules linked in: [2024260.470897] CPU: 15 PID: 0 Comm: swapper/15 Tainted: GW 4.13.9 #2 [2024260.479488] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 2.5.5 08/16/2017 [2024260.488279] task: 88085b625cc0 task.stack: c90e4000 [2024260.495277] RIP: 0010:dst_destroy+0x35/0xa0 [2024260.500277] RSP: 0018:88085f5c3f08 EFLAGS: 00010286 [2024260.506474] RAX: 88085ac0e880 RBX: 88082cf9fb00 RCX: 0020 [2024260.514868] RDX: 88082cf9fbc0 RSI: RDI: 816786c0 [2024260.523258] RBP: R08: ff00 R09: [2024260.531649] R10: R11: R12: 88085f5da678 [2024260.540040] R13: 000a R14: 88085b625cc0 R15: 88085b625cc0 [2024260.548431] FS: () GS:88085f5c() knlGS: [2024260.557924] CS: 0010 DS: ES: CR0: 80050033 [2024260.564719] CR2: 7fc800e48e88 CR3: 01809000 CR4: 001406e0 [2024260.573112] Call Trace: [2024260.576113] [2024260.578618] ? rcu_process_callbacks+0x18f/0x460 [2024260.584126] ? rebalance_domains+0xe2/0x290 [2024260.589128] ? __do_softirq+0x100/0x292 [2024260.593727] ? irq_exit+0x92/0xa0 [2024260.597729] ? smp_apic_timer_interrupt+0x39/0x50 [2024260.603328] ? apic_timer_interrupt+0x7c/0x90 [2024260.608528] [2024260.611134] ? cpuidle_enter_state+0x14c/0x2b0 [2024260.616432] ? cpuidle_enter_state+0x128/0x2b0 [2024260.621731] ? do_idle+0xf9/0x190 [2024260.625733] ? cpu_startup_entry+0x5f/0x70 [2024260.630636] ? start_secondary+0x12a/0x130 [2024260.635536] ? secondary_startup_64+0x9f/0x9f [2024260.640731] Code: f6 47 60 08 48 8b 6f 18 74 62 48 8b 43 20 48 8b 40 30 48 85 c0 74 05 48 89 df ff d0 48 8b 03 48 85 c0 74 0a 48 8b 80 e0 03 00 00 <65> ff 08 f6 43 60 80 74 26 48 8d bb e0 00 00 00 e8 e6 7f 01 00 [2024260.662626] RIP: dst_destroy+0x35/0xa0 RSP: 88085f5c3f08 [2024260.669333] ---[ end trace 3c1827251806827c ]--- [2024260.724173] Kernel panic - not syncing: Fatal exception in interrupt [2024261.102792] Kernel Offset: disabled [2024261.156022] Rebooting in 60 seconds.. [2024321.167958] ACPI MEMORY or I/O RESET_REG. [ 36.620034] general protection fault: [#1] SMP [ 36.625637] Modules linked in: [ 36.629141] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.13.9 #2 [ 36.635938] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 2.5.5 08/16/2017 [ 36.644532] task: 88085b46a7c0 task.stack: c907c000 [ 36.651333] RIP: 0010:dst_destroy+0x35/0xa0 [ 36.656133] RSP: 0018:88085f283f08 EFLAGS: 00010286 [ 36.662133] RAX: 2e37307830203a65 RBX: 88082ac1 RCX: 0020 [ 36.670326] RDX: 88082ac100c0 RSI: RDI: 816786c0 [ 36.678521] RBP: R08: 30e3e201 R09: 00010080007a [ 36.686714] R10: 88085f283e20 R11: ea0020c38e00 R12: 88085f29a678 [ 36.694906] R13: 000a R14: 88085b46a7c0 R15: 88085b46a7c0 [ 36.703102] FS: () GS:88085f28() knlGS: [ 36.712395] CS: 0010 DS: ES: CR0: 80050033 [ 36.718992] CR2: 55568c725558 CR3: 01809000 CR4: 001406e0 [ 36.727184] Call Trace: [ 36.729987] [ 36.732287] ? rcu_process_callbacks+0x18f/0x460 [ 36.737588] ? rebalance_domains+0xe2/0x290 [ 36.742388] ? __do_softirq+0x100/0x292 [ 36.746790] ? irq_exit+0x92/0xa0 [ 36.750590] ? smp_apic_timer_interrupt+0x39/0x50 [ 36.755990] ? apic_timer_interrupt+0x7c/0x90 [ 36.760987] [ 36.763392] ? poll_idle+0x46/0x7a [ 36.767295] ? cpuidle_enter_state+0x102/0x2b0 [ 36.772396] ? do_idle+0xf9/0x190 [ 36.776197] ? cpu_startup_entry+0x5f/0x70 [ 36.780892] ? start_secondary+0x12a/0x130 [ 36.785592] ? secondary_startup_64+0x9f/0x9f [ 36.790590] Code: f6 47 60 08 48 8b 6f 18 74 62 48 8b 43 20 48 8b 40 30 48 85