Re: Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101

2016-12-20 Thread Lennart Sorensen
On Mon, Dec 19, 2016 at 01:09:39PM -0500, Lennart Sorensen wrote:
> I am trying to debug a problem that has been happening occationally for
> years on some of our systems running 3.0.101 kernel (yes I know it is
> old, we are moving to 4.9 at the moment but I would like older releases
> to be fixed too, assuming 4.9 makes this problem disappear).
> 
> What is happening is that once in a while a process does something wrong
> and segfaults, and dumps core.  We have a handler to process the core dump
> to name it and compress it and make sure we don't keep to many around,
> so the core_pattern uses the pipe option to pipe the dump to a shell
> script that saves it with the pid and current timestamp and gzips it.
> 
> Once in a while when this happens, the kernel hits a null pointer
> dereference in fpu.state->xsave while doing __switch_to.
> 
> The system ix x86_64 with dual E5-2620 CPUs (6 cores each with
> hyperthreading).  Some people think they have seen it on other systems,
> but are not sure.  I have not been able to trigger it on other systems
> yet.
> 
> It used to take about a week of running tests to trigger it, but I have
> now managed to hit it in a few minutes pretty reliably.

If the core_pattern is not set to use a pipe, but just save as core.%e.%p
then the problem does not happen.

-- 
Len Sorensen


Re: Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101

2016-12-20 Thread Lennart Sorensen
On Mon, Dec 19, 2016 at 01:09:39PM -0500, Lennart Sorensen wrote:
> I am trying to debug a problem that has been happening occationally for
> years on some of our systems running 3.0.101 kernel (yes I know it is
> old, we are moving to 4.9 at the moment but I would like older releases
> to be fixed too, assuming 4.9 makes this problem disappear).
> 
> What is happening is that once in a while a process does something wrong
> and segfaults, and dumps core.  We have a handler to process the core dump
> to name it and compress it and make sure we don't keep to many around,
> so the core_pattern uses the pipe option to pipe the dump to a shell
> script that saves it with the pid and current timestamp and gzips it.
> 
> Once in a while when this happens, the kernel hits a null pointer
> dereference in fpu.state->xsave while doing __switch_to.
> 
> The system ix x86_64 with dual E5-2620 CPUs (6 cores each with
> hyperthreading).  Some people think they have seen it on other systems,
> but are not sure.  I have not been able to trigger it on other systems
> yet.
> 
> It used to take about a week of running tests to trigger it, but I have
> now managed to hit it in a few minutes pretty reliably.

If the core_pattern is not set to use a pipe, but just save as core.%e.%p
then the problem does not happen.

-- 
Len Sorensen


Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101

2016-12-19 Thread Lennart Sorensen
I am trying to debug a problem that has been happening occationally for
years on some of our systems running 3.0.101 kernel (yes I know it is
old, we are moving to 4.9 at the moment but I would like older releases
to be fixed too, assuming 4.9 makes this problem disappear).

What is happening is that once in a while a process does something wrong
and segfaults, and dumps core.  We have a handler to process the core dump
to name it and compress it and make sure we don't keep to many around,
so the core_pattern uses the pipe option to pipe the dump to a shell
script that saves it with the pid and current timestamp and gzips it.

Once in a while when this happens, the kernel hits a null pointer
dereference in fpu.state->xsave while doing __switch_to.

The system ix x86_64 with dual E5-2620 CPUs (6 cores each with
hyperthreading).  Some people think they have seen it on other systems,
but are not sure.  I have not been able to trigger it on other systems
yet.

It used to take about a week of running tests to trigger it, but I have
now managed to hit it in a few minutes pretty reliably.

The way I trigger it is:

abuse.c:

#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main() {
pid_t pid;

pid = getpid();

printf("%u: pid / PI = %f\n", pid, pid / M_PI);

kill(pid, SIGSEGV);
return 0;
}

I then run:
for i in `seq 1 1 1`; do ./abuse & echo -n; done

That pretty reliably hits the problem.

The crash dump we get is:
Dec 19 12:24:03 HWC-64 kernel: BUG: unable to handle kernel NULL pointer 
dereference at 033f
Dec 19 12:24:03 HWC-64 kernel: IP: [] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel: PGD 39a6ce067 PUD 39a8cf067 PMD 0
Dec 19 12:24:03 HWC-64 kernel: Oops: 0002 [#1] SMP
Dec 19 12:24:03 HWC-64 kernel: CPU 12
Dec 19 12:24:03 HWC-64 kernel: Modules linked in: dpi_drv(P) ccu_util(P) 
ipv4_mb(P) l2bridge_config_util(P) l2_config_util(P) route_config_util(P) 
qos_config_util(P) sysapp_common(P) chantry_fwd_eng_2800_config(P) 
shim_module(P) sadb_cc(P) ipsecXformer(P) libeCrypto(P) ipmatch_cc(P) l2h_cc(P) 
ndproxy_cc(P) arpint_cc(P) portinfo_cc(P) chantryqos_cc(P) redirector_cc(P) 
ix_ph(P) fpm_core_cc(P) pulse_cc(P) vnstt_cc(P) vnsap_cc(P) fm_cc(P) rutm_cc(P) 
mutm_cc(P) ethernet_tx_cc(P) stkdrv_cc(P) l2bridge_cc(P) events_util(P) 
sched_cc(P) qm_cc(P) ipv4_cc(P) wred_cc(P) tc_meter_cc(P) dscp_classifier_cc(P) 
classifier_6t_cc(P) ent586_cc(P) dev_cc_arp(P) chantry_fwd_eng_2800_tables(P) 
ether_arp_lib(P) rtmv4_lib(P) lkup_lib(P) l2tm_lib(P) fragmentation_lib(P) 
properties_lib(P) msg_support_lib(P) utilities_lib(P) cci_lib(P) rm_lib(P) 
libossl vip productSpec_x86_dp(P) ixgbe igb
Dec 19 12:24:03 HWC-64 kernel:
Dec 19 12:24:03 HWC-64 kernel: Pid: 16440, comm: coremgr.sh Tainted: P  
  3.0.101 #6 Intel Corporation S2600GZ ../S2600GZ
Dec 19 12:24:03 HWC-64 kernel: RIP: 0010:[]  
[] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel: RSP: 0018:88042f2c9de8  EFLAGS: 00010002
Dec 19 12:24:03 HWC-64 kernel: RAX:  RBX: 8803d8266ae0 RCX: 
000c
Dec 19 12:24:03 HWC-64 kernel: RDX:  RSI: 88042f29b840 RDI: 

Dec 19 12:24:03 HWC-64 kernel: RBP: 88043f38da40 R08: 8803d8266e08 R09: 
000100018011
Dec 19 12:24:03 HWC-64 kernel: R10: 4000 R11: 0246 R12: 

Dec 19 12:24:03 HWC-64 kernel: R13: 88042ce62080 R14: 000c R15: 
88042f29b840
Dec 19 12:24:03 HWC-64 kernel: FS:  7f72f71a5700() 
GS:88043f38() knlGS:
Dec 19 12:24:03 HWC-64 kernel: CS:  0010 DS:  ES:  CR0: 80050033
Dec 19 12:24:03 HWC-64 kernel: CR2: 033f CR3: 0003d8c7 CR4: 
000406e0
Dec 19 12:24:03 HWC-64 kernel: DR0:  DR1:  DR2: 

Dec 19 12:24:03 HWC-64 kernel: DR3:  DR6: 0ff0 DR7: 
0400
Dec 19 12:24:03 HWC-64 kernel: Process coremgr.sh (pid: 16440, threadinfo 
88039a00c000, task 8803d8266ae0)
Dec 19 12:24:03 HWC-64 kernel: Stack:
Dec 19 12:24:03 HWC-64 kernel:  0018 88043f38fe40 
88043f38fe40 88042f29b840
Dec 19 12:24:03 HWC-64 kernel:   81301c94 
88042f29b840 0046
Dec 19 12:24:03 HWC-64 kernel:  88042f2c9fd8 88042f2c9fd8 
a000 
Dec 19 12:24:03 HWC-64 kernel: Call Trace:
Dec 19 12:24:03 HWC-64 kernel: Code: 48 63 c1 48 03 14 c5 40 a0 67 81 8b 87 b8 
03 00 00 48 89 d5 85 c0 74 37 66 66 90 66 90 b8 ff ff ff ff 48 8b bf c0 03 00 
00 89 c2
Dec 19 12:24:03 HWC-64 kernel: <48> 0f ae 37 48 8b 83 c0 03 00 00 f6 80 00 02 
00 00 01 0f 84 88
Dec 19 12:24:03 HWC-64 kernel: RIP  [] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel:  RSP 
Dec 19 12:24:03 HWC-64 kernel: CR2: 033f
Dec 19 12:24:03 HWC-64 kernel: ---[ end trace 366732e1020fb678 ]---

That is 

Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101

2016-12-19 Thread Lennart Sorensen
I am trying to debug a problem that has been happening occationally for
years on some of our systems running 3.0.101 kernel (yes I know it is
old, we are moving to 4.9 at the moment but I would like older releases
to be fixed too, assuming 4.9 makes this problem disappear).

What is happening is that once in a while a process does something wrong
and segfaults, and dumps core.  We have a handler to process the core dump
to name it and compress it and make sure we don't keep to many around,
so the core_pattern uses the pipe option to pipe the dump to a shell
script that saves it with the pid and current timestamp and gzips it.

Once in a while when this happens, the kernel hits a null pointer
dereference in fpu.state->xsave while doing __switch_to.

The system ix x86_64 with dual E5-2620 CPUs (6 cores each with
hyperthreading).  Some people think they have seen it on other systems,
but are not sure.  I have not been able to trigger it on other systems
yet.

It used to take about a week of running tests to trigger it, but I have
now managed to hit it in a few minutes pretty reliably.

The way I trigger it is:

abuse.c:

#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main() {
pid_t pid;

pid = getpid();

printf("%u: pid / PI = %f\n", pid, pid / M_PI);

kill(pid, SIGSEGV);
return 0;
}

I then run:
for i in `seq 1 1 1`; do ./abuse & echo -n; done

That pretty reliably hits the problem.

The crash dump we get is:
Dec 19 12:24:03 HWC-64 kernel: BUG: unable to handle kernel NULL pointer 
dereference at 033f
Dec 19 12:24:03 HWC-64 kernel: IP: [] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel: PGD 39a6ce067 PUD 39a8cf067 PMD 0
Dec 19 12:24:03 HWC-64 kernel: Oops: 0002 [#1] SMP
Dec 19 12:24:03 HWC-64 kernel: CPU 12
Dec 19 12:24:03 HWC-64 kernel: Modules linked in: dpi_drv(P) ccu_util(P) 
ipv4_mb(P) l2bridge_config_util(P) l2_config_util(P) route_config_util(P) 
qos_config_util(P) sysapp_common(P) chantry_fwd_eng_2800_config(P) 
shim_module(P) sadb_cc(P) ipsecXformer(P) libeCrypto(P) ipmatch_cc(P) l2h_cc(P) 
ndproxy_cc(P) arpint_cc(P) portinfo_cc(P) chantryqos_cc(P) redirector_cc(P) 
ix_ph(P) fpm_core_cc(P) pulse_cc(P) vnstt_cc(P) vnsap_cc(P) fm_cc(P) rutm_cc(P) 
mutm_cc(P) ethernet_tx_cc(P) stkdrv_cc(P) l2bridge_cc(P) events_util(P) 
sched_cc(P) qm_cc(P) ipv4_cc(P) wred_cc(P) tc_meter_cc(P) dscp_classifier_cc(P) 
classifier_6t_cc(P) ent586_cc(P) dev_cc_arp(P) chantry_fwd_eng_2800_tables(P) 
ether_arp_lib(P) rtmv4_lib(P) lkup_lib(P) l2tm_lib(P) fragmentation_lib(P) 
properties_lib(P) msg_support_lib(P) utilities_lib(P) cci_lib(P) rm_lib(P) 
libossl vip productSpec_x86_dp(P) ixgbe igb
Dec 19 12:24:03 HWC-64 kernel:
Dec 19 12:24:03 HWC-64 kernel: Pid: 16440, comm: coremgr.sh Tainted: P  
  3.0.101 #6 Intel Corporation S2600GZ ../S2600GZ
Dec 19 12:24:03 HWC-64 kernel: RIP: 0010:[]  
[] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel: RSP: 0018:88042f2c9de8  EFLAGS: 00010002
Dec 19 12:24:03 HWC-64 kernel: RAX:  RBX: 8803d8266ae0 RCX: 
000c
Dec 19 12:24:03 HWC-64 kernel: RDX:  RSI: 88042f29b840 RDI: 

Dec 19 12:24:03 HWC-64 kernel: RBP: 88043f38da40 R08: 8803d8266e08 R09: 
000100018011
Dec 19 12:24:03 HWC-64 kernel: R10: 4000 R11: 0246 R12: 

Dec 19 12:24:03 HWC-64 kernel: R13: 88042ce62080 R14: 000c R15: 
88042f29b840
Dec 19 12:24:03 HWC-64 kernel: FS:  7f72f71a5700() 
GS:88043f38() knlGS:
Dec 19 12:24:03 HWC-64 kernel: CS:  0010 DS:  ES:  CR0: 80050033
Dec 19 12:24:03 HWC-64 kernel: CR2: 033f CR3: 0003d8c7 CR4: 
000406e0
Dec 19 12:24:03 HWC-64 kernel: DR0:  DR1:  DR2: 

Dec 19 12:24:03 HWC-64 kernel: DR3:  DR6: 0ff0 DR7: 
0400
Dec 19 12:24:03 HWC-64 kernel: Process coremgr.sh (pid: 16440, threadinfo 
88039a00c000, task 8803d8266ae0)
Dec 19 12:24:03 HWC-64 kernel: Stack:
Dec 19 12:24:03 HWC-64 kernel:  0018 88043f38fe40 
88043f38fe40 88042f29b840
Dec 19 12:24:03 HWC-64 kernel:   81301c94 
88042f29b840 0046
Dec 19 12:24:03 HWC-64 kernel:  88042f2c9fd8 88042f2c9fd8 
a000 
Dec 19 12:24:03 HWC-64 kernel: Call Trace:
Dec 19 12:24:03 HWC-64 kernel: Code: 48 63 c1 48 03 14 c5 40 a0 67 81 8b 87 b8 
03 00 00 48 89 d5 85 c0 74 37 66 66 90 66 90 b8 ff ff ff ff 48 8b bf c0 03 00 
00 89 c2
Dec 19 12:24:03 HWC-64 kernel: <48> 0f ae 37 48 8b 83 c0 03 00 00 f6 80 00 02 
00 00 01 0f 84 88
Dec 19 12:24:03 HWC-64 kernel: RIP  [] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel:  RSP 
Dec 19 12:24:03 HWC-64 kernel: CR2: 033f
Dec 19 12:24:03 HWC-64 kernel: ---[ end trace 366732e1020fb678 ]---

That is