NULL pointer dereference in process_one_work

2017-11-23 Thread baiyaowei
Hi,tj and jiangshan,

I build a ceph storage pool to run some benchmarks with 3.10 kernel.
Occasionally, when the cpus' load is very high, some nodes crash with
message below.

[292273.612014] BUG: unable to handle kernel NULL pointer dereference at
0008
[292273.612057] IP: [] process_one_work+0x31/0x470
[292273.612087] PGD 0 
[292273.612099] Oops:  [#1] SMP 
[292273.612117] Modules linked in: rbd(OE) bcache(OE) ip_vs xfs
xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
xt_conntrack nf_conntrack ipt_REJECT tun bridge stp llc ebtable_filter
ebtables ip6table_filter ip6_tables iptable_filter bonding
intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul
ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper
cryptd mxm_wmi iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf pcspkr
ipmi_ssif mei_me sg lpc_ich mei sb_edac ipmi_si mfd_core edac_core
ipmi_msghandler shpchp wmi acpi_power_meter nfsd auth_rpcgss nfs_acl
lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif
crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit
drm_kms_helper
[292273.612495]  crct10dif_pclmul crct10dif_common ttm crc32c_intel drm
ahci nvme bnx2x libahci i2c_core libata mdio libcrc32c megaraid_sas ptp
pps_core dm_mirror dm_region_hash dm_log dm_mod
[292273.612580] CPU: 16 PID: 353223 Comm: kworker/16:2 Tainted: G
OE     3.10.0-327.el7.x86_64 #1
[292273.612620] Hardware name: Dell Inc. PowerEdge R730xd/0WCJNT, BIOS
2.4.3 01/17/2017
[292273.612655] task: 8801f55e6780 ti: 882a199b task.ti:
882a199b
[292273.612685] RIP: 0010:[]  []
process_one_work+0x31/0x470
[292273.612721] RSP: 0018:882a199b3e28  EFLAGS: 00010046
[292273.612743] RAX:  RBX: 88088b273028 RCX:
882a199b3fd8
[292273.612771] RDX:  RSI: 88088b273028 RDI:
88088b273000
[292273.612799] RBP: 882a199b3e60 R08:  R09:
0770
[292273.612827] R10: 8822a3bb1f80 R11: 8822a3bb1f80 R12:
88088b273000
[292273.612855] R13: 881fff313fc0 R14:  R15:
881fff313fc0
[292273.612883] FS:  () GS:881fff30()
knlGS:
[292273.612914] CS:  0010 DS:  ES:  CR0: 80050033
[292273.612937] CR2: 00b8 CR3: 0194a000 CR4:
003407e0
[292273.612965] DR0:  DR1:  DR2:

[292273.612994] DR3:  DR6: fffe0ff0 DR7:
0400
[292273.613021] Stack:
[292273.613031]  ff313fd8  881fff313fd8
000188088b273030
[292273.613069]  8801f55e6780 88088b273000 881fff313fc0
882a199b3ec0
[292273.613108]  8109e4cc 882a199b3fd8 882a199b3fd8
8801f55e6780
[292273.613146] Call Trace:
[292273.613160]  [] worker_thread+0x21c/0x400
[292273.613185]  [] ? rescuer_thread+0x400/0x400
[292273.613212]  [] kthread+0xcf/0xe0
[292273.613234]  [] ?
kthread_create_on_node+0x140/0x140
[292273.613263]  [] ret_from_fork+0x58/0x90
[292273.613287]  [] ?
kthread_create_on_node+0x140/0x140
[292273.614303] Code: 48 89 e5 41 57 41 56 45 31 f6 41 55 41 54 49 89 fc
53 48 89 f3 48 83 ec 10 48 8b 06 4c 8b 6f 48 48 89 c2 30 d2 a8 04 4c 0f
45 f2 <49> 8b 46 08 44 8b b8 00 01 00 00 41 c1 ef 05 44 89 f8 83 e0 01 
[292273.617971] RIP  [] process_one_work+0x31/0x470
[292273.620011]  RSP 
[292273.621940] CR2: 0008

Some crash messsages:

crash> sys
  KERNEL: /usr/lib/debug/lib/modules/3.10.0-327.el7.x86_64/vmlinux
DUMPFILE: vmcore  [PARTIAL DUMP]
CPUS: 32
DATE: Wed Oct 18 05:21:14 2017
  UPTIME: 3 days, 09:07:25
LOAD AVERAGE: 221.70, 222.22, 224.96
   TASKS: 3115
NODENAME: node121
 RELEASE: 3.10.0-327.el7.x86_64
 VERSION: #1 SMP Thu Nov 19 22:10:57 UTC 2015
 MACHINE: x86_64  (2099 Mhz)
  MEMORY: 255.9 GB
   PANIC: "BUG: unable to handle kernel NULL pointer dereference at
0008"
crash> bt
PID: 353223  TASK: 8801f55e6780  CPU: 16  COMMAND: "kworker/16:2"
 #0 [882a199b3af0] machine_kexec at 81051beb
 #1 [882a199b3b50] crash_kexec at 810f2542
 #2 [882a199b3c20] oops_end at 8163e1a8
 #3 [882a199b3c48] no_context at 8162e2b8
 #4 [882a199b3c98] __bad_area_nosemaphore at 8162e34e
 #5 [882a199b3ce0] bad_area_nosemaphore at 8162e4b8
 #6 [882a199b3cf0] __do_page_fault at 81640fce
 #7 [882a199b3d48] do_page_fault at 81641113
 #8 [882a199b3d70] page_fault at 8163d408
[exception RIP: process_one_work+49]
RIP: 8109d4b1  RSP: 882a199b3e28  RFLAGS: 00010046
RAX:   RBX: 88088b273028  RCX: 882a199b3fd8
RDX:   RSI: 88088b273028  RDI: 88088b273000
RBP: 882a199b3e60   R8:    R9: 0770

NULL pointer dereference in process_one_work

2017-11-23 Thread baiyaowei
Hi,tj and jiangshan,

I build a ceph storage pool to run some benchmarks with 3.10 kernel.
Occasionally, when the cpus' load is very high, some nodes crash with
message below.

[292273.612014] BUG: unable to handle kernel NULL pointer dereference at
0008
[292273.612057] IP: [] process_one_work+0x31/0x470
[292273.612087] PGD 0 
[292273.612099] Oops:  [#1] SMP 
[292273.612117] Modules linked in: rbd(OE) bcache(OE) ip_vs xfs
xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
xt_conntrack nf_conntrack ipt_REJECT tun bridge stp llc ebtable_filter
ebtables ip6table_filter ip6_tables iptable_filter bonding
intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul
ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper
cryptd mxm_wmi iTCO_wdt iTCO_vendor_support dcdbas ipmi_devintf pcspkr
ipmi_ssif mei_me sg lpc_ich mei sb_edac ipmi_si mfd_core edac_core
ipmi_msghandler shpchp wmi acpi_power_meter nfsd auth_rpcgss nfs_acl
lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif
crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt i2c_algo_bit
drm_kms_helper
[292273.612495]  crct10dif_pclmul crct10dif_common ttm crc32c_intel drm
ahci nvme bnx2x libahci i2c_core libata mdio libcrc32c megaraid_sas ptp
pps_core dm_mirror dm_region_hash dm_log dm_mod
[292273.612580] CPU: 16 PID: 353223 Comm: kworker/16:2 Tainted: G
OE     3.10.0-327.el7.x86_64 #1
[292273.612620] Hardware name: Dell Inc. PowerEdge R730xd/0WCJNT, BIOS
2.4.3 01/17/2017
[292273.612655] task: 8801f55e6780 ti: 882a199b task.ti:
882a199b
[292273.612685] RIP: 0010:[]  []
process_one_work+0x31/0x470
[292273.612721] RSP: 0018:882a199b3e28  EFLAGS: 00010046
[292273.612743] RAX:  RBX: 88088b273028 RCX:
882a199b3fd8
[292273.612771] RDX:  RSI: 88088b273028 RDI:
88088b273000
[292273.612799] RBP: 882a199b3e60 R08:  R09:
0770
[292273.612827] R10: 8822a3bb1f80 R11: 8822a3bb1f80 R12:
88088b273000
[292273.612855] R13: 881fff313fc0 R14:  R15:
881fff313fc0
[292273.612883] FS:  () GS:881fff30()
knlGS:
[292273.612914] CS:  0010 DS:  ES:  CR0: 80050033
[292273.612937] CR2: 00b8 CR3: 0194a000 CR4:
003407e0
[292273.612965] DR0:  DR1:  DR2:

[292273.612994] DR3:  DR6: fffe0ff0 DR7:
0400
[292273.613021] Stack:
[292273.613031]  ff313fd8  881fff313fd8
000188088b273030
[292273.613069]  8801f55e6780 88088b273000 881fff313fc0
882a199b3ec0
[292273.613108]  8109e4cc 882a199b3fd8 882a199b3fd8
8801f55e6780
[292273.613146] Call Trace:
[292273.613160]  [] worker_thread+0x21c/0x400
[292273.613185]  [] ? rescuer_thread+0x400/0x400
[292273.613212]  [] kthread+0xcf/0xe0
[292273.613234]  [] ?
kthread_create_on_node+0x140/0x140
[292273.613263]  [] ret_from_fork+0x58/0x90
[292273.613287]  [] ?
kthread_create_on_node+0x140/0x140
[292273.614303] Code: 48 89 e5 41 57 41 56 45 31 f6 41 55 41 54 49 89 fc
53 48 89 f3 48 83 ec 10 48 8b 06 4c 8b 6f 48 48 89 c2 30 d2 a8 04 4c 0f
45 f2 <49> 8b 46 08 44 8b b8 00 01 00 00 41 c1 ef 05 44 89 f8 83 e0 01 
[292273.617971] RIP  [] process_one_work+0x31/0x470
[292273.620011]  RSP 
[292273.621940] CR2: 0008

Some crash messsages:

crash> sys
  KERNEL: /usr/lib/debug/lib/modules/3.10.0-327.el7.x86_64/vmlinux
DUMPFILE: vmcore  [PARTIAL DUMP]
CPUS: 32
DATE: Wed Oct 18 05:21:14 2017
  UPTIME: 3 days, 09:07:25
LOAD AVERAGE: 221.70, 222.22, 224.96
   TASKS: 3115
NODENAME: node121
 RELEASE: 3.10.0-327.el7.x86_64
 VERSION: #1 SMP Thu Nov 19 22:10:57 UTC 2015
 MACHINE: x86_64  (2099 Mhz)
  MEMORY: 255.9 GB
   PANIC: "BUG: unable to handle kernel NULL pointer dereference at
0008"
crash> bt
PID: 353223  TASK: 8801f55e6780  CPU: 16  COMMAND: "kworker/16:2"
 #0 [882a199b3af0] machine_kexec at 81051beb
 #1 [882a199b3b50] crash_kexec at 810f2542
 #2 [882a199b3c20] oops_end at 8163e1a8
 #3 [882a199b3c48] no_context at 8162e2b8
 #4 [882a199b3c98] __bad_area_nosemaphore at 8162e34e
 #5 [882a199b3ce0] bad_area_nosemaphore at 8162e4b8
 #6 [882a199b3cf0] __do_page_fault at 81640fce
 #7 [882a199b3d48] do_page_fault at 81641113
 #8 [882a199b3d70] page_fault at 8163d408
[exception RIP: process_one_work+49]
RIP: 8109d4b1  RSP: 882a199b3e28  RFLAGS: 00010046
RAX:   RBX: 88088b273028  RCX: 882a199b3fd8
RDX:   RSI: 88088b273028  RDI: 88088b273000
RBP: 882a199b3e60   R8:    R9: 0770