subject:"Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101"

Re: Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101

2016-12-20 Thread Lennart Sorensen

On Mon, Dec 19, 2016 at 01:09:39PM -0500, Lennart Sorensen wrote:
> I am trying to debug a problem that has been happening occationally for
> years on some of our systems running 3.0.101 kernel (yes I know it is
> old, we are moving to 4.9 at the moment but I would like older releases
> to be fixed too, assuming 4.9 makes this problem disappear).
> 
> What is happening is that once in a while a process does something wrong
> and segfaults, and dumps core.  We have a handler to process the core dump
> to name it and compress it and make sure we don't keep to many around,
> so the core_pattern uses the pipe option to pipe the dump to a shell
> script that saves it with the pid and current timestamp and gzips it.
> 
> Once in a while when this happens, the kernel hits a null pointer
> dereference in fpu.state->xsave while doing __switch_to.
> 
> The system ix x86_64 with dual E5-2620 CPUs (6 cores each with
> hyperthreading).  Some people think they have seen it on other systems,
> but are not sure.  I have not been able to trigger it on other systems
> yet.
> 
> It used to take about a week of running tests to trigger it, but I have
> now managed to hit it in a few minutes pretty reliably.

If the core_pattern is not set to use a pipe, but just save as core.%e.%p
then the problem does not happen.

-- 
Len Sorensen

Re: Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101

2016-12-20 Thread Lennart Sorensen

On Mon, Dec 19, 2016 at 01:09:39PM -0500, Lennart Sorensen wrote:
> I am trying to debug a problem that has been happening occationally for
> years on some of our systems running 3.0.101 kernel (yes I know it is
> old, we are moving to 4.9 at the moment but I would like older releases
> to be fixed too, assuming 4.9 makes this problem disappear).
> 
> What is happening is that once in a while a process does something wrong
> and segfaults, and dumps core.  We have a handler to process the core dump
> to name it and compress it and make sure we don't keep to many around,
> so the core_pattern uses the pipe option to pipe the dump to a shell
> script that saves it with the pid and current timestamp and gzips it.
> 
> Once in a while when this happens, the kernel hits a null pointer
> dereference in fpu.state->xsave while doing __switch_to.
> 
> The system ix x86_64 with dual E5-2620 CPUs (6 cores each with
> hyperthreading).  Some people think they have seen it on other systems,
> but are not sure.  I have not been able to trigger it on other systems
> yet.
> 
> It used to take about a week of running tests to trigger it, but I have
> now managed to hit it in a few minutes pretty reliably.

If the core_pattern is not set to use a pipe, but just save as core.%e.%p
then the problem does not happen.

-- 
Len Sorensen

Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101

2016-12-19 Thread Lennart Sorensen

I am trying to debug a problem that has been happening occationally for
years on some of our systems running 3.0.101 kernel (yes I know it is
old, we are moving to 4.9 at the moment but I would like older releases
to be fixed too, assuming 4.9 makes this problem disappear).

What is happening is that once in a while a process does something wrong
and segfaults, and dumps core.  We have a handler to process the core dump
to name it and compress it and make sure we don't keep to many around,
so the core_pattern uses the pipe option to pipe the dump to a shell
script that saves it with the pid and current timestamp and gzips it.

Once in a while when this happens, the kernel hits a null pointer
dereference in fpu.state->xsave while doing __switch_to.

The system ix x86_64 with dual E5-2620 CPUs (6 cores each with
hyperthreading).  Some people think they have seen it on other systems,
but are not sure.  I have not been able to trigger it on other systems
yet.

It used to take about a week of running tests to trigger it, but I have
now managed to hit it in a few minutes pretty reliably.

The way I trigger it is:

abuse.c:

#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main() {
pid_t pid;

pid = getpid();

printf("%u: pid / PI = %f\n", pid, pid / M_PI);

kill(pid, SIGSEGV);
return 0;
}

I then run:
for i in `seq 1 1 1`; do ./abuse & echo -n; done

That pretty reliably hits the problem.

The crash dump we get is:
Dec 19 12:24:03 HWC-64 kernel: BUG: unable to handle kernel NULL pointer 
dereference at 033f
Dec 19 12:24:03 HWC-64 kernel: IP: [] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel: PGD 39a6ce067 PUD 39a8cf067 PMD 0
Dec 19 12:24:03 HWC-64 kernel: Oops: 0002 [#1] SMP
Dec 19 12:24:03 HWC-64 kernel: CPU 12
Dec 19 12:24:03 HWC-64 kernel: Modules linked in: dpi_drv(P) ccu_util(P) 
ipv4_mb(P) l2bridge_config_util(P) l2_config_util(P) route_config_util(P) 
qos_config_util(P) sysapp_common(P) chantry_fwd_eng_2800_config(P) 
shim_module(P) sadb_cc(P) ipsecXformer(P) libeCrypto(P) ipmatch_cc(P) l2h_cc(P) 
ndproxy_cc(P) arpint_cc(P) portinfo_cc(P) chantryqos_cc(P) redirector_cc(P) 
ix_ph(P) fpm_core_cc(P) pulse_cc(P) vnstt_cc(P) vnsap_cc(P) fm_cc(P) rutm_cc(P) 
mutm_cc(P) ethernet_tx_cc(P) stkdrv_cc(P) l2bridge_cc(P) events_util(P) 
sched_cc(P) qm_cc(P) ipv4_cc(P) wred_cc(P) tc_meter_cc(P) dscp_classifier_cc(P) 
classifier_6t_cc(P) ent586_cc(P) dev_cc_arp(P) chantry_fwd_eng_2800_tables(P) 
ether_arp_lib(P) rtmv4_lib(P) lkup_lib(P) l2tm_lib(P) fragmentation_lib(P) 
properties_lib(P) msg_support_lib(P) utilities_lib(P) cci_lib(P) rm_lib(P) 
libossl vip productSpec_x86_dp(P) ixgbe igb
Dec 19 12:24:03 HWC-64 kernel:
Dec 19 12:24:03 HWC-64 kernel: Pid: 16440, comm: coremgr.sh Tainted: P  
  3.0.101 #6 Intel Corporation S2600GZ ../S2600GZ
Dec 19 12:24:03 HWC-64 kernel: RIP: 0010:[]  
[] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel: RSP: 0018:88042f2c9de8  EFLAGS: 00010002
Dec 19 12:24:03 HWC-64 kernel: RAX:  RBX: 8803d8266ae0 RCX: 
000c
Dec 19 12:24:03 HWC-64 kernel: RDX:  RSI: 88042f29b840 RDI: 

Dec 19 12:24:03 HWC-64 kernel: RBP: 88043f38da40 R08: 8803d8266e08 R09: 
000100018011
Dec 19 12:24:03 HWC-64 kernel: R10: 4000 R11: 0246 R12: 

Dec 19 12:24:03 HWC-64 kernel: R13: 88042ce62080 R14: 000c R15: 
88042f29b840
Dec 19 12:24:03 HWC-64 kernel: FS:  7f72f71a5700() 
GS:88043f38() knlGS:
Dec 19 12:24:03 HWC-64 kernel: CS:  0010 DS:  ES:  CR0: 80050033
Dec 19 12:24:03 HWC-64 kernel: CR2: 033f CR3: 0003d8c7 CR4: 
000406e0
Dec 19 12:24:03 HWC-64 kernel: DR0:  DR1:  DR2: 

Dec 19 12:24:03 HWC-64 kernel: DR3:  DR6: 0ff0 DR7: 
0400
Dec 19 12:24:03 HWC-64 kernel: Process coremgr.sh (pid: 16440, threadinfo 
88039a00c000, task 8803d8266ae0)
Dec 19 12:24:03 HWC-64 kernel: Stack:
Dec 19 12:24:03 HWC-64 kernel:  0018 88043f38fe40 
88043f38fe40 88042f29b840
Dec 19 12:24:03 HWC-64 kernel:   81301c94 
88042f29b840 0046
Dec 19 12:24:03 HWC-64 kernel:  88042f2c9fd8 88042f2c9fd8 
a000 
Dec 19 12:24:03 HWC-64 kernel: Call Trace:
Dec 19 12:24:03 HWC-64 kernel: Code: 48 63 c1 48 03 14 c5 40 a0 67 81 8b 87 b8 
03 00 00 48 89 d5 85 c0 74 37 66 66 90 66 90 b8 ff ff ff ff 48 8b bf c0 03 00 
00 89 c2
Dec 19 12:24:03 HWC-64 kernel: <48> 0f ae 37 48 8b 83 c0 03 00 00 f6 80 00 02 
00 00 01 0f 84 88
Dec 19 12:24:03 HWC-64 kernel: RIP  [] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel:  RSP 
Dec 19 12:24:03 HWC-64 kernel: CR2: 033f
Dec 19 12:24:03 HWC-64 kernel: ---[ end trace 366732e1020fb678 ]---

That is

Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101

2016-12-19 Thread Lennart Sorensen

I am trying to debug a problem that has been happening occationally for
years on some of our systems running 3.0.101 kernel (yes I know it is
old, we are moving to 4.9 at the moment but I would like older releases
to be fixed too, assuming 4.9 makes this problem disappear).

What is happening is that once in a while a process does something wrong
and segfaults, and dumps core.  We have a handler to process the core dump
to name it and compress it and make sure we don't keep to many around,
so the core_pattern uses the pipe option to pipe the dump to a shell
script that saves it with the pid and current timestamp and gzips it.

Once in a while when this happens, the kernel hits a null pointer
dereference in fpu.state->xsave while doing __switch_to.

The system ix x86_64 with dual E5-2620 CPUs (6 cores each with
hyperthreading).  Some people think they have seen it on other systems,
but are not sure.  I have not been able to trigger it on other systems
yet.

It used to take about a week of running tests to trigger it, but I have
now managed to hit it in a few minutes pretty reliably.

The way I trigger it is:

abuse.c:

#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main() {
pid_t pid;

pid = getpid();

printf("%u: pid / PI = %f\n", pid, pid / M_PI);

kill(pid, SIGSEGV);
return 0;
}

I then run:
for i in `seq 1 1 1`; do ./abuse & echo -n; done

That pretty reliably hits the problem.

The crash dump we get is:
Dec 19 12:24:03 HWC-64 kernel: BUG: unable to handle kernel NULL pointer 
dereference at 033f
Dec 19 12:24:03 HWC-64 kernel: IP: [] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel: PGD 39a6ce067 PUD 39a8cf067 PMD 0
Dec 19 12:24:03 HWC-64 kernel: Oops: 0002 [#1] SMP
Dec 19 12:24:03 HWC-64 kernel: CPU 12
Dec 19 12:24:03 HWC-64 kernel: Modules linked in: dpi_drv(P) ccu_util(P) 
ipv4_mb(P) l2bridge_config_util(P) l2_config_util(P) route_config_util(P) 
qos_config_util(P) sysapp_common(P) chantry_fwd_eng_2800_config(P) 
shim_module(P) sadb_cc(P) ipsecXformer(P) libeCrypto(P) ipmatch_cc(P) l2h_cc(P) 
ndproxy_cc(P) arpint_cc(P) portinfo_cc(P) chantryqos_cc(P) redirector_cc(P) 
ix_ph(P) fpm_core_cc(P) pulse_cc(P) vnstt_cc(P) vnsap_cc(P) fm_cc(P) rutm_cc(P) 
mutm_cc(P) ethernet_tx_cc(P) stkdrv_cc(P) l2bridge_cc(P) events_util(P) 
sched_cc(P) qm_cc(P) ipv4_cc(P) wred_cc(P) tc_meter_cc(P) dscp_classifier_cc(P) 
classifier_6t_cc(P) ent586_cc(P) dev_cc_arp(P) chantry_fwd_eng_2800_tables(P) 
ether_arp_lib(P) rtmv4_lib(P) lkup_lib(P) l2tm_lib(P) fragmentation_lib(P) 
properties_lib(P) msg_support_lib(P) utilities_lib(P) cci_lib(P) rm_lib(P) 
libossl vip productSpec_x86_dp(P) ixgbe igb
Dec 19 12:24:03 HWC-64 kernel:
Dec 19 12:24:03 HWC-64 kernel: Pid: 16440, comm: coremgr.sh Tainted: P  
  3.0.101 #6 Intel Corporation S2600GZ ../S2600GZ
Dec 19 12:24:03 HWC-64 kernel: RIP: 0010:[]  
[] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel: RSP: 0018:88042f2c9de8  EFLAGS: 00010002
Dec 19 12:24:03 HWC-64 kernel: RAX:  RBX: 8803d8266ae0 RCX: 
000c
Dec 19 12:24:03 HWC-64 kernel: RDX:  RSI: 88042f29b840 RDI: 

Dec 19 12:24:03 HWC-64 kernel: RBP: 88043f38da40 R08: 8803d8266e08 R09: 
000100018011
Dec 19 12:24:03 HWC-64 kernel: R10: 4000 R11: 0246 R12: 

Dec 19 12:24:03 HWC-64 kernel: R13: 88042ce62080 R14: 000c R15: 
88042f29b840
Dec 19 12:24:03 HWC-64 kernel: FS:  7f72f71a5700() 
GS:88043f38() knlGS:
Dec 19 12:24:03 HWC-64 kernel: CS:  0010 DS:  ES:  CR0: 80050033
Dec 19 12:24:03 HWC-64 kernel: CR2: 033f CR3: 0003d8c7 CR4: 
000406e0
Dec 19 12:24:03 HWC-64 kernel: DR0:  DR1:  DR2: 

Dec 19 12:24:03 HWC-64 kernel: DR3:  DR6: 0ff0 DR7: 
0400
Dec 19 12:24:03 HWC-64 kernel: Process coremgr.sh (pid: 16440, threadinfo 
88039a00c000, task 8803d8266ae0)
Dec 19 12:24:03 HWC-64 kernel: Stack:
Dec 19 12:24:03 HWC-64 kernel:  0018 88043f38fe40 
88043f38fe40 88042f29b840
Dec 19 12:24:03 HWC-64 kernel:   81301c94 
88042f29b840 0046
Dec 19 12:24:03 HWC-64 kernel:  88042f2c9fd8 88042f2c9fd8 
a000 
Dec 19 12:24:03 HWC-64 kernel: Call Trace:
Dec 19 12:24:03 HWC-64 kernel: Code: 48 63 c1 48 03 14 c5 40 a0 67 81 8b 87 b8 
03 00 00 48 89 d5 85 c0 74 37 66 66 90 66 90 b8 ff ff ff ff 48 8b bf c0 03 00 
00 89 c2
Dec 19 12:24:03 HWC-64 kernel: <48> 0f ae 37 48 8b 83 c0 03 00 00 f6 80 00 02 
00 00 01 0f 84 88
Dec 19 12:24:03 HWC-64 kernel: RIP  [] __switch_to+0x4c/0x2b0
Dec 19 12:24:03 HWC-64 kernel:  RSP 
Dec 19 12:24:03 HWC-64 kernel: CR2: 033f
Dec 19 12:24:03 HWC-64 kernel: ---[ end trace 366732e1020fb678 ]---

That is

Re: Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101

Re: Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101

Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101

Debug hints for fpu state NULL pointer dereference on context switch during core dump in 3.0.101

4 matches

Site Navigation

Mail list logo

Footer information