[Kernel-packages] [Bug 1315736] Re: [Dell PowerEdge R720] Machine Check Exception
I seem to have this bug also. While this is on a production server, I have some flexibility in rebooting it. I can note a few issues: 1. The kernel bug only happens with Java (tested both open-jdk7 and oracle8) 2. The java processes block and cannot be killed 3. Any process that tries to inspect the java process becomes blocked (e.g. top, ps, ...). an strace of a ps: open(/proc/41126/status, O_RDONLY)= 6 read(6, Name:\tjava\nState:\tD (disk sleep)..., 1024) = 870 close(6)= 0 open(/proc/41126/cmdline, O_RDONLY) = 6 read(6, [BLOCKS there] 4. As long as no queries are done on the blocked java processes, everything works (though the load of the machine is apparently high) Tell me what you need done to test this, and I will do it -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1315736 Title: [Dell PowerEdge R720] Machine Check Exception Status in “linux” package in Ubuntu: Incomplete Bug description: Dell PowerEdge 720 on ubuntu 14.04 shows MCE errors on dmesg. Dell support instructed to run DSET and BIOS hardware diagnostics. Neither of the tools showed any errors. Dell support said that if there was a hardware error it would have been shown on Dell logs and the probable reason for the dmesg log is a bug in ubuntu kernel MCE reporting. So, is it that following dmesg is because of a kernel bug in ubuntu 14.04 server? [11562.171040] Please check user daemon is running. [94953.306404] sbridge: HANDLING MCE MEMORY ERROR [94953.306415] CPU 1: Machine Check Exception: 0 Bank 9: 8c4b000800c0 [94953.306416] TSC 0 ADDR 2dfa0e1000 MISC 9800080168c PROCESSOR 0:306e4 TIME 1399142359 SOCKET 1 APIC 20 [94953.306422] sbridge: HANDLING MCE MEMORY ERROR [94953.306423] CPU 1: Machine Check Exception: 0 Bank 10: 8c5800c1 [94953.306424] TSC 0 ADDR 2dfa0e1000 MISC 900208c PROCESSOR 0:306e4 TIME 1399142359 SOCKET 1 APIC 20 [94953.532217] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Channel#0_DIMM#0 (channel:0 slot:0 page:0x2dfa0e1 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c0 socket:1 channel_mask:3 rank:0) [94953.532226] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Channel#1_DIMM#0 (channel:1 slot:0 page:0x2dfa0e1 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 channel_mask:3 rank:0) --- AlsaDevices: total 0 crw-rw 1 root audio 116, 1 touko 2 19:15 seq crw-rw 1 root audio 116, 33 touko 2 19:15 timer AplayDevices: Error: [Errno 2] No such file or directory ApportVersion: 2.14.1-0ubuntu3 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: Error: [Errno 2] No such file or directory CurrentDmesg: Error: command ['sh', '-c', 'dmesg | comm -13 --nocheck-order /var/log/dmesg -'] failed with exit code 1: comm: /var/log/dmesg: Permission denied dmesg: write failed: Broken pipe DistroRelease: Ubuntu 14.04 InstallationDate: Installed on 2014-02-26 (66 days ago) InstallationMedia: Ubuntu-Server 14.04 LTS Trusty Tahr - Alpha amd64 (20140219) MachineType: Dell Inc. PowerEdge R720 Package: linux (not installed) PciMultimedia: ProcFB: 0 VESA VGA ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-24-generic root=UUID=c03eb237-955a-4dee-bba1-deded53df372 ro ProcVersionSignature: Ubuntu 3.13.0-24.46-generic 3.13.9 RfKill: Error: [Errno 2] No such file or directory Tags: trusty Uname: Linux 3.13.0-24-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: WifiSyslog: _MarkForUpload: True dmi.bios.date: 01/16/2014 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.2.2 dmi.board.name: 0DCWD1 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.2.2:bd01/16/2014:svnDellInc.:pnPowerEdgeR720:pvr:rvnDellInc.:rn0DCWD1:rvrA01:cvnDellInc.:ct23:cvr: dmi.product.name: PowerEdge R720 dmi.sys.vendor: Dell Inc. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1315736/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1315736] Re: [Dell PowerEdge R720] Machine Check Exception
I will do this, but one important comment: I am on a supermicro, not a dell. But the bug seems the same (same bug kernel line, and also java- related taints) -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1315736 Title: [Dell PowerEdge R720] Machine Check Exception Status in “linux” package in Ubuntu: Incomplete Bug description: Dell PowerEdge 720 on ubuntu 14.04 shows MCE errors on dmesg. Dell support instructed to run DSET and BIOS hardware diagnostics. Neither of the tools showed any errors. Dell support said that if there was a hardware error it would have been shown on Dell logs and the probable reason for the dmesg log is a bug in ubuntu kernel MCE reporting. So, is it that following dmesg is because of a kernel bug in ubuntu 14.04 server? [11562.171040] Please check user daemon is running. [94953.306404] sbridge: HANDLING MCE MEMORY ERROR [94953.306415] CPU 1: Machine Check Exception: 0 Bank 9: 8c4b000800c0 [94953.306416] TSC 0 ADDR 2dfa0e1000 MISC 9800080168c PROCESSOR 0:306e4 TIME 1399142359 SOCKET 1 APIC 20 [94953.306422] sbridge: HANDLING MCE MEMORY ERROR [94953.306423] CPU 1: Machine Check Exception: 0 Bank 10: 8c5800c1 [94953.306424] TSC 0 ADDR 2dfa0e1000 MISC 900208c PROCESSOR 0:306e4 TIME 1399142359 SOCKET 1 APIC 20 [94953.532217] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Channel#0_DIMM#0 (channel:0 slot:0 page:0x2dfa0e1 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c0 socket:1 channel_mask:3 rank:0) [94953.532226] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Channel#1_DIMM#0 (channel:1 slot:0 page:0x2dfa0e1 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 channel_mask:3 rank:0) --- AlsaDevices: total 0 crw-rw 1 root audio 116, 1 touko 2 19:15 seq crw-rw 1 root audio 116, 33 touko 2 19:15 timer AplayDevices: Error: [Errno 2] No such file or directory ApportVersion: 2.14.1-0ubuntu3 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: Error: [Errno 2] No such file or directory CurrentDmesg: Error: command ['sh', '-c', 'dmesg | comm -13 --nocheck-order /var/log/dmesg -'] failed with exit code 1: comm: /var/log/dmesg: Permission denied dmesg: write failed: Broken pipe DistroRelease: Ubuntu 14.04 InstallationDate: Installed on 2014-02-26 (66 days ago) InstallationMedia: Ubuntu-Server 14.04 LTS Trusty Tahr - Alpha amd64 (20140219) MachineType: Dell Inc. PowerEdge R720 Package: linux (not installed) PciMultimedia: ProcFB: 0 VESA VGA ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-24-generic root=UUID=c03eb237-955a-4dee-bba1-deded53df372 ro ProcVersionSignature: Ubuntu 3.13.0-24.46-generic 3.13.9 RfKill: Error: [Errno 2] No such file or directory Tags: trusty Uname: Linux 3.13.0-24-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: WifiSyslog: _MarkForUpload: True dmi.bios.date: 01/16/2014 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.2.2 dmi.board.name: 0DCWD1 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.2.2:bd01/16/2014:svnDellInc.:pnPowerEdgeR720:pvr:rvnDellInc.:rn0DCWD1:rvrA01:cvnDellInc.:ct23:cvr: dmi.product.name: PowerEdge R720 dmi.sys.vendor: Dell Inc. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1315736/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1315736] Re: [Dell PowerEdge R720] Machine Check Exception
We have now installed the new kernel, but as the bug is non- deterministic, we will have to wait until it manifests itself. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1315736 Title: [Dell PowerEdge R720] Machine Check Exception Status in “linux” package in Ubuntu: Incomplete Bug description: Dell PowerEdge 720 on ubuntu 14.04 shows MCE errors on dmesg. Dell support instructed to run DSET and BIOS hardware diagnostics. Neither of the tools showed any errors. Dell support said that if there was a hardware error it would have been shown on Dell logs and the probable reason for the dmesg log is a bug in ubuntu kernel MCE reporting. So, is it that following dmesg is because of a kernel bug in ubuntu 14.04 server? [11562.171040] Please check user daemon is running. [94953.306404] sbridge: HANDLING MCE MEMORY ERROR [94953.306415] CPU 1: Machine Check Exception: 0 Bank 9: 8c4b000800c0 [94953.306416] TSC 0 ADDR 2dfa0e1000 MISC 9800080168c PROCESSOR 0:306e4 TIME 1399142359 SOCKET 1 APIC 20 [94953.306422] sbridge: HANDLING MCE MEMORY ERROR [94953.306423] CPU 1: Machine Check Exception: 0 Bank 10: 8c5800c1 [94953.306424] TSC 0 ADDR 2dfa0e1000 MISC 900208c PROCESSOR 0:306e4 TIME 1399142359 SOCKET 1 APIC 20 [94953.532217] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Channel#0_DIMM#0 (channel:0 slot:0 page:0x2dfa0e1 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c0 socket:1 channel_mask:3 rank:0) [94953.532226] EDAC MC1: 1 CE memory scrubbing error on CPU_SrcID#1_Channel#1_DIMM#0 (channel:1 slot:0 page:0x2dfa0e1 offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0008:00c1 socket:1 channel_mask:3 rank:0) --- AlsaDevices: total 0 crw-rw 1 root audio 116, 1 touko 2 19:15 seq crw-rw 1 root audio 116, 33 touko 2 19:15 timer AplayDevices: Error: [Errno 2] No such file or directory ApportVersion: 2.14.1-0ubuntu3 Architecture: amd64 ArecordDevices: Error: [Errno 2] No such file or directory AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: CRDA: Error: [Errno 2] No such file or directory CurrentDmesg: Error: command ['sh', '-c', 'dmesg | comm -13 --nocheck-order /var/log/dmesg -'] failed with exit code 1: comm: /var/log/dmesg: Permission denied dmesg: write failed: Broken pipe DistroRelease: Ubuntu 14.04 InstallationDate: Installed on 2014-02-26 (66 days ago) InstallationMedia: Ubuntu-Server 14.04 LTS Trusty Tahr - Alpha amd64 (20140219) MachineType: Dell Inc. PowerEdge R720 Package: linux (not installed) PciMultimedia: ProcFB: 0 VESA VGA ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-24-generic root=UUID=c03eb237-955a-4dee-bba1-deded53df372 ro ProcVersionSignature: Ubuntu 3.13.0-24.46-generic 3.13.9 RfKill: Error: [Errno 2] No such file or directory Tags: trusty Uname: Linux 3.13.0-24-generic x86_64 UpgradeStatus: No upgrade log present (probably fresh install) UserGroups: WifiSyslog: _MarkForUpload: True dmi.bios.date: 01/16/2014 dmi.bios.vendor: Dell Inc. dmi.bios.version: 2.2.2 dmi.board.name: 0DCWD1 dmi.board.vendor: Dell Inc. dmi.board.version: A01 dmi.chassis.type: 23 dmi.chassis.vendor: Dell Inc. dmi.modalias: dmi:bvnDellInc.:bvr2.2.2:bd01/16/2014:svnDellInc.:pnPowerEdgeR720:pvr:rvnDellInc.:rn0DCWD1:rvrA01:cvnDellInc.:ct23:cvr: dmi.product.name: PowerEdge R720 dmi.sys.vendor: Dell Inc. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1315736/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp
[Kernel-packages] [Bug 1315736] Re: [Dell PowerEdge R720] Machine Check Exception
Sami, Good observation: I do not have a machine check exception. The similarities are: a reported bug on the same line; similar behaviour; and java involved. For reference I copy my kernel bug below (I get several instances of this, only that the next ones are tainted). As soon as I have a problem with the new upstream kernel I will report it back May 9 09:55:29 wintermute kernel: [604868.582044] [ cut here ] May 9 09:55:29 wintermute kernel: [604868.582059] kernel BUG at /build/buildd/linux-3.13.0/mm/memory.c:3756! May 9 09:55:29 wintermute kernel: [604868.582064] invalid opcode: [#1] SMP May 9 09:55:29 wintermute kernel: [604868.582069] Modules linked in: veth xt_addrtype xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables bridge stp llc bnep rfcomm bluetooth aufs binfmt_misc kvm_amd kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd parport_pc ppdev psmouse amd64_edac_mod sp5100_tco serio_raw edac_core lp fam15h_power k10temp i2c_piix4 edac_mce_amd mac_hid parport hid_generic usbhid hid usb_storage ixgbe igb mdio i2c_algo_bit dca ahci ptp libahci pps_core May 9 09:55:29 wintermute kernel: [604868.582148] CPU: 21 PID: 25260 Comm: java Not tainted 3.13.0-24-generic #46-Ubuntu May 9 09:55:29 wintermute kernel: [604868.582152] Hardware name: Supermicro H8QG6/H8QG6, BIOS 3.512/16/2013 May 9 09:55:29 wintermute kernel: [604868.582156] task: 8876d3985fc0 ti: 8871f58c8000 task.ti: 8871f58c8000 May 9 09:55:29 wintermute kernel: [604868.582159] RIP: 0010:[81179051] [81179051] handle_mm_fault+0xe61/0xf10 May 9 09:55:29 wintermute kernel: [604868.582171] RSP: :8871f58c9d98 EFLAGS: 00010246 May 9 09:55:29 wintermute kernel: [604868.582174] RAX: 0100 RBX: 7fa583801ea0 RCX: 8871f58c9b18 May 9 09:55:29 wintermute kernel: [604868.582177] RDX: 8876d3985fc0 RSI: RDI: 8020286009e6 May 9 09:55:29 wintermute kernel: [604868.582180] RBP: 8871f58c9e20 R08: R09: 00a9 May 9 09:55:29 wintermute kernel: [604868.582182] R10: 0001 R11: R12: 883fb68b30e0 May 9 09:55:29 wintermute kernel: [604868.582185] R13: 882e351b2600 R14: 88702aceec80 R15: 0080 May 9 09:55:29 wintermute kernel: [604868.582188] FS: 7fa5603f2700() GS:882fe7d4() knlGS: May 9 09:55:29 wintermute kernel: [604868.582192] CS: 0010 DS: ES: CR0: 8005003b May 9 09:55:29 wintermute kernel: [604868.582194] CR2: 7fa583a05620 CR3: 007861d59000 CR4: 000407e0 May 9 09:55:29 wintermute kernel: [604868.582198] Stack: May 9 09:55:29 wintermute kernel: [604868.582200] 8871f58c9e20 88702aceec80 7fad7d38fd70 7fa583804020 May 9 09:55:29 wintermute kernel: [604868.582241] 2190 7fad7401bb68 0002 May 9 09:55:29 wintermute kernel: [604868.582266] 887101ef5e20 7fad781a900f 88a9 ff03 May 9 09:55:29 wintermute kernel: [604868.582283] Call Trace: May 9 09:55:29 wintermute kernel: [604868.582297] [817219a4] __do_page_fault+0x184/0x560 May 9 09:55:29 wintermute kernel: [604868.582311] [82fc] ? acct_account_cputime+0x1c/0x20 May 9 09:55:29 wintermute kernel: [604868.582321] [8109d76b] ? account_user_time+0x8b/0xa0 May 9 09:55:29 wintermute kernel: [604868.582329] [8109dd84] ? vtime_account_user+0x54/0x60 May 9 09:55:29 wintermute kernel: [604868.582338] [81721d9a] do_page_fault+0x1a/0x70 May 9 09:55:29 wintermute kernel: [604868.582349] [8171e208] page_fault+0x28/0x30 May 9 09:55:29 wintermute kernel: [604868.582353] Code: ff 48 89 d9 4c 89 e2 4c 89 ee 4c 89 f7 44 89 4d c8 e8 34 c1 ff ff 85 c0 0f 85 94 f5 ff ff 49 8b 3c 24 44 8b 4d c8 e9 68 f3 ff ff 0f 0b be 8e 00 00 00 48 c7 c7 18 25 a6 81 44 89 4d c8 e8 18 e7 May 9 09:55:29 wintermute kernel: [604868.582415] RIP [81179051] handle_mm_fault+0xe61/0xf10 May 9 09:55:29 wintermute kernel: [604868.582421] RSP 8871f58c9d98 May 9 09:55:29 wintermute kernel: [604868.582426] ---[ end trace 77f5d1b963750a41 ]--- -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1315736 Title: [Dell PowerEdge R720] Machine Check Exception Status in “linux” package in Ubuntu: Incomplete Bug description: Dell PowerEdge 720 on ubuntu 14.04 shows MCE errors on dmesg. Dell support instructed to run DSET and BIOS hardware diagnostics. Neither of the tools showed any errors. Dell support said that if there was a hardware error it would have been shown on Dell logs and the probable