Re: Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101
On Mon, Dec 19, 2016 at 01:09:39PM -0500, Lennart Sorensen wrote: > I am trying to debug a problem that has been happening occationally for > years on some of our systems running 3.0.101 kernel (yes I know it is > old, we are moving to 4.9 at the moment but I would like older releases > to be fixed too, assuming 4.9 makes this problem disappear). > > What is happening is that once in a while a process does something wrong > and segfaults, and dumps core. We have a handler to process the core dump > to name it and compress it and make sure we don't keep to many around, > so the core_pattern uses the pipe option to pipe the dump to a shell > script that saves it with the pid and current timestamp and gzips it. > > Once in a while when this happens, the kernel hits a null pointer > dereference in fpu.state->xsave while doing __switch_to. > > The system ix x86_64 with dual E5-2620 CPUs (6 cores each with > hyperthreading). Some people think they have seen it on other systems, > but are not sure. I have not been able to trigger it on other systems > yet. > > It used to take about a week of running tests to trigger it, but I have > now managed to hit it in a few minutes pretty reliably. If the core_pattern is not set to use a pipe, but just save as core.%e.%p then the problem does not happen. -- Len Sorensen
Re: Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101
On Mon, Dec 19, 2016 at 01:09:39PM -0500, Lennart Sorensen wrote: > I am trying to debug a problem that has been happening occationally for > years on some of our systems running 3.0.101 kernel (yes I know it is > old, we are moving to 4.9 at the moment but I would like older releases > to be fixed too, assuming 4.9 makes this problem disappear). > > What is happening is that once in a while a process does something wrong > and segfaults, and dumps core. We have a handler to process the core dump > to name it and compress it and make sure we don't keep to many around, > so the core_pattern uses the pipe option to pipe the dump to a shell > script that saves it with the pid and current timestamp and gzips it. > > Once in a while when this happens, the kernel hits a null pointer > dereference in fpu.state->xsave while doing __switch_to. > > The system ix x86_64 with dual E5-2620 CPUs (6 cores each with > hyperthreading). Some people think they have seen it on other systems, > but are not sure. I have not been able to trigger it on other systems > yet. > > It used to take about a week of running tests to trigger it, but I have > now managed to hit it in a few minutes pretty reliably. If the core_pattern is not set to use a pipe, but just save as core.%e.%p then the problem does not happen. -- Len Sorensen
Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101
I am trying to debug a problem that has been happening occationally for years on some of our systems running 3.0.101 kernel (yes I know it is old, we are moving to 4.9 at the moment but I would like older releases to be fixed too, assuming 4.9 makes this problem disappear). What is happening is that once in a while a process does something wrong and segfaults, and dumps core. We have a handler to process the core dump to name it and compress it and make sure we don't keep to many around, so the core_pattern uses the pipe option to pipe the dump to a shell script that saves it with the pid and current timestamp and gzips it. Once in a while when this happens, the kernel hits a null pointer dereference in fpu.state->xsave while doing __switch_to. The system ix x86_64 with dual E5-2620 CPUs (6 cores each with hyperthreading). Some people think they have seen it on other systems, but are not sure. I have not been able to trigger it on other systems yet. It used to take about a week of running tests to trigger it, but I have now managed to hit it in a few minutes pretty reliably. The way I trigger it is: abuse.c: #include #include #include #include #include #include #include int main() { pid_t pid; pid = getpid(); printf("%u: pid / PI = %f\n", pid, pid / M_PI); kill(pid, SIGSEGV); return 0; } I then run: for i in `seq 1 1 1`; do ./abuse & echo -n; done That pretty reliably hits the problem. The crash dump we get is: Dec 19 12:24:03 HWC-64 kernel: BUG: unable to handle kernel NULL pointer dereference at 033f Dec 19 12:24:03 HWC-64 kernel: IP: [] __switch_to+0x4c/0x2b0 Dec 19 12:24:03 HWC-64 kernel: PGD 39a6ce067 PUD 39a8cf067 PMD 0 Dec 19 12:24:03 HWC-64 kernel: Oops: 0002 [#1] SMP Dec 19 12:24:03 HWC-64 kernel: CPU 12 Dec 19 12:24:03 HWC-64 kernel: Modules linked in: dpi_drv(P) ccu_util(P) ipv4_mb(P) l2bridge_config_util(P) l2_config_util(P) route_config_util(P) qos_config_util(P) sysapp_common(P) chantry_fwd_eng_2800_config(P) shim_module(P) sadb_cc(P) ipsecXformer(P) libeCrypto(P) ipmatch_cc(P) l2h_cc(P) ndproxy_cc(P) arpint_cc(P) portinfo_cc(P) chantryqos_cc(P) redirector_cc(P) ix_ph(P) fpm_core_cc(P) pulse_cc(P) vnstt_cc(P) vnsap_cc(P) fm_cc(P) rutm_cc(P) mutm_cc(P) ethernet_tx_cc(P) stkdrv_cc(P) l2bridge_cc(P) events_util(P) sched_cc(P) qm_cc(P) ipv4_cc(P) wred_cc(P) tc_meter_cc(P) dscp_classifier_cc(P) classifier_6t_cc(P) ent586_cc(P) dev_cc_arp(P) chantry_fwd_eng_2800_tables(P) ether_arp_lib(P) rtmv4_lib(P) lkup_lib(P) l2tm_lib(P) fragmentation_lib(P) properties_lib(P) msg_support_lib(P) utilities_lib(P) cci_lib(P) rm_lib(P) libossl vip productSpec_x86_dp(P) ixgbe igb Dec 19 12:24:03 HWC-64 kernel: Dec 19 12:24:03 HWC-64 kernel: Pid: 16440, comm: coremgr.sh Tainted: P 3.0.101 #6 Intel Corporation S2600GZ ../S2600GZ Dec 19 12:24:03 HWC-64 kernel: RIP: 0010:[] [] __switch_to+0x4c/0x2b0 Dec 19 12:24:03 HWC-64 kernel: RSP: 0018:88042f2c9de8 EFLAGS: 00010002 Dec 19 12:24:03 HWC-64 kernel: RAX: RBX: 8803d8266ae0 RCX: 000c Dec 19 12:24:03 HWC-64 kernel: RDX: RSI: 88042f29b840 RDI: Dec 19 12:24:03 HWC-64 kernel: RBP: 88043f38da40 R08: 8803d8266e08 R09: 000100018011 Dec 19 12:24:03 HWC-64 kernel: R10: 4000 R11: 0246 R12: Dec 19 12:24:03 HWC-64 kernel: R13: 88042ce62080 R14: 000c R15: 88042f29b840 Dec 19 12:24:03 HWC-64 kernel: FS: 7f72f71a5700() GS:88043f38() knlGS: Dec 19 12:24:03 HWC-64 kernel: CS: 0010 DS: ES: CR0: 80050033 Dec 19 12:24:03 HWC-64 kernel: CR2: 033f CR3: 0003d8c7 CR4: 000406e0 Dec 19 12:24:03 HWC-64 kernel: DR0: DR1: DR2: Dec 19 12:24:03 HWC-64 kernel: DR3: DR6: 0ff0 DR7: 0400 Dec 19 12:24:03 HWC-64 kernel: Process coremgr.sh (pid: 16440, threadinfo 88039a00c000, task 8803d8266ae0) Dec 19 12:24:03 HWC-64 kernel: Stack: Dec 19 12:24:03 HWC-64 kernel: 0018 88043f38fe40 88043f38fe40 88042f29b840 Dec 19 12:24:03 HWC-64 kernel: 81301c94 88042f29b840 0046 Dec 19 12:24:03 HWC-64 kernel: 88042f2c9fd8 88042f2c9fd8 a000 Dec 19 12:24:03 HWC-64 kernel: Call Trace: Dec 19 12:24:03 HWC-64 kernel: Code: 48 63 c1 48 03 14 c5 40 a0 67 81 8b 87 b8 03 00 00 48 89 d5 85 c0 74 37 66 66 90 66 90 b8 ff ff ff ff 48 8b bf c0 03 00 00 89 c2 Dec 19 12:24:03 HWC-64 kernel: <48> 0f ae 37 48 8b 83 c0 03 00 00 f6 80 00 02 00 00 01 0f 84 88 Dec 19 12:24:03 HWC-64 kernel: RIP [] __switch_to+0x4c/0x2b0 Dec 19 12:24:03 HWC-64 kernel: RSP Dec 19 12:24:03 HWC-64 kernel: CR2: 033f Dec 19 12:24:03 HWC-64 kernel: ---[ end trace 366732e1020fb678 ]--- That is
Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101
I am trying to debug a problem that has been happening occationally for years on some of our systems running 3.0.101 kernel (yes I know it is old, we are moving to 4.9 at the moment but I would like older releases to be fixed too, assuming 4.9 makes this problem disappear). What is happening is that once in a while a process does something wrong and segfaults, and dumps core. We have a handler to process the core dump to name it and compress it and make sure we don't keep to many around, so the core_pattern uses the pipe option to pipe the dump to a shell script that saves it with the pid and current timestamp and gzips it. Once in a while when this happens, the kernel hits a null pointer dereference in fpu.state->xsave while doing __switch_to. The system ix x86_64 with dual E5-2620 CPUs (6 cores each with hyperthreading). Some people think they have seen it on other systems, but are not sure. I have not been able to trigger it on other systems yet. It used to take about a week of running tests to trigger it, but I have now managed to hit it in a few minutes pretty reliably. The way I trigger it is: abuse.c: #include #include #include #include #include #include #include int main() { pid_t pid; pid = getpid(); printf("%u: pid / PI = %f\n", pid, pid / M_PI); kill(pid, SIGSEGV); return 0; } I then run: for i in `seq 1 1 1`; do ./abuse & echo -n; done That pretty reliably hits the problem. The crash dump we get is: Dec 19 12:24:03 HWC-64 kernel: BUG: unable to handle kernel NULL pointer dereference at 033f Dec 19 12:24:03 HWC-64 kernel: IP: [] __switch_to+0x4c/0x2b0 Dec 19 12:24:03 HWC-64 kernel: PGD 39a6ce067 PUD 39a8cf067 PMD 0 Dec 19 12:24:03 HWC-64 kernel: Oops: 0002 [#1] SMP Dec 19 12:24:03 HWC-64 kernel: CPU 12 Dec 19 12:24:03 HWC-64 kernel: Modules linked in: dpi_drv(P) ccu_util(P) ipv4_mb(P) l2bridge_config_util(P) l2_config_util(P) route_config_util(P) qos_config_util(P) sysapp_common(P) chantry_fwd_eng_2800_config(P) shim_module(P) sadb_cc(P) ipsecXformer(P) libeCrypto(P) ipmatch_cc(P) l2h_cc(P) ndproxy_cc(P) arpint_cc(P) portinfo_cc(P) chantryqos_cc(P) redirector_cc(P) ix_ph(P) fpm_core_cc(P) pulse_cc(P) vnstt_cc(P) vnsap_cc(P) fm_cc(P) rutm_cc(P) mutm_cc(P) ethernet_tx_cc(P) stkdrv_cc(P) l2bridge_cc(P) events_util(P) sched_cc(P) qm_cc(P) ipv4_cc(P) wred_cc(P) tc_meter_cc(P) dscp_classifier_cc(P) classifier_6t_cc(P) ent586_cc(P) dev_cc_arp(P) chantry_fwd_eng_2800_tables(P) ether_arp_lib(P) rtmv4_lib(P) lkup_lib(P) l2tm_lib(P) fragmentation_lib(P) properties_lib(P) msg_support_lib(P) utilities_lib(P) cci_lib(P) rm_lib(P) libossl vip productSpec_x86_dp(P) ixgbe igb Dec 19 12:24:03 HWC-64 kernel: Dec 19 12:24:03 HWC-64 kernel: Pid: 16440, comm: coremgr.sh Tainted: P 3.0.101 #6 Intel Corporation S2600GZ ../S2600GZ Dec 19 12:24:03 HWC-64 kernel: RIP: 0010:[] [] __switch_to+0x4c/0x2b0 Dec 19 12:24:03 HWC-64 kernel: RSP: 0018:88042f2c9de8 EFLAGS: 00010002 Dec 19 12:24:03 HWC-64 kernel: RAX: RBX: 8803d8266ae0 RCX: 000c Dec 19 12:24:03 HWC-64 kernel: RDX: RSI: 88042f29b840 RDI: Dec 19 12:24:03 HWC-64 kernel: RBP: 88043f38da40 R08: 8803d8266e08 R09: 000100018011 Dec 19 12:24:03 HWC-64 kernel: R10: 4000 R11: 0246 R12: Dec 19 12:24:03 HWC-64 kernel: R13: 88042ce62080 R14: 000c R15: 88042f29b840 Dec 19 12:24:03 HWC-64 kernel: FS: 7f72f71a5700() GS:88043f38() knlGS: Dec 19 12:24:03 HWC-64 kernel: CS: 0010 DS: ES: CR0: 80050033 Dec 19 12:24:03 HWC-64 kernel: CR2: 033f CR3: 0003d8c7 CR4: 000406e0 Dec 19 12:24:03 HWC-64 kernel: DR0: DR1: DR2: Dec 19 12:24:03 HWC-64 kernel: DR3: DR6: 0ff0 DR7: 0400 Dec 19 12:24:03 HWC-64 kernel: Process coremgr.sh (pid: 16440, threadinfo 88039a00c000, task 8803d8266ae0) Dec 19 12:24:03 HWC-64 kernel: Stack: Dec 19 12:24:03 HWC-64 kernel: 0018 88043f38fe40 88043f38fe40 88042f29b840 Dec 19 12:24:03 HWC-64 kernel: 81301c94 88042f29b840 0046 Dec 19 12:24:03 HWC-64 kernel: 88042f2c9fd8 88042f2c9fd8 a000 Dec 19 12:24:03 HWC-64 kernel: Call Trace: Dec 19 12:24:03 HWC-64 kernel: Code: 48 63 c1 48 03 14 c5 40 a0 67 81 8b 87 b8 03 00 00 48 89 d5 85 c0 74 37 66 66 90 66 90 b8 ff ff ff ff 48 8b bf c0 03 00 00 89 c2 Dec 19 12:24:03 HWC-64 kernel: <48> 0f ae 37 48 8b 83 c0 03 00 00 f6 80 00 02 00 00 01 0f 84 88 Dec 19 12:24:03 HWC-64 kernel: RIP [] __switch_to+0x4c/0x2b0 Dec 19 12:24:03 HWC-64 kernel: RSP Dec 19 12:24:03 HWC-64 kernel: CR2: 033f Dec 19 12:24:03 HWC-64 kernel: ---[ end trace 366732e1020fb678 ]--- That is