[Kernel-packages] [Bug 1315736] Re: [Dell PowerEdge R720] Machine Check Exception

2014-05-12 Thread Tiago Antao
I seem to have this bug also. While this is on a production server, I
have some flexibility in rebooting it.

I can note a few issues:

1. The kernel bug only happens with Java (tested both open-jdk7 and
oracle8)

2. The java processes block and cannot be killed

3. Any process that tries to inspect the java process becomes blocked (e.g. 
top, ps, ...). an strace of a ps:
open(/proc/41126/status, O_RDONLY)= 6
read(6, Name:\tjava\nState:\tD (disk sleep)..., 1024) = 870
close(6)= 0
open(/proc/41126/cmdline, O_RDONLY)   = 6
read(6, 
[BLOCKS there]

4. As long as no queries are done on the blocked java processes,
everything works (though the load of the machine is apparently high)

Tell me what you need done to test this, and I will do it

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1315736

Title:
  [Dell PowerEdge R720] Machine Check Exception

Status in “linux” package in Ubuntu:
  Incomplete

Bug description:
  Dell PowerEdge 720 on ubuntu 14.04 shows MCE errors on dmesg. Dell
  support instructed to run DSET and BIOS hardware diagnostics. Neither
  of the tools showed any errors. Dell support said that if there was a
  hardware error it would have been shown on Dell logs and the probable
  reason for the dmesg log is a bug in ubuntu kernel MCE reporting.

  So, is it that following dmesg is because of a kernel bug in ubuntu
  14.04 server?

  [11562.171040] Please check user daemon is running.
  [94953.306404] sbridge: HANDLING MCE MEMORY ERROR
  [94953.306415] CPU 1: Machine Check Exception: 0 Bank 9: 8c4b000800c0
  [94953.306416] TSC 0 ADDR 2dfa0e1000 MISC 9800080168c PROCESSOR 0:306e4 
TIME 1399142359 SOCKET 1 APIC 20
  [94953.306422] sbridge: HANDLING MCE MEMORY ERROR
  [94953.306423] CPU 1: Machine Check Exception: 0 Bank 10: 8c5800c1
  [94953.306424] TSC 0 ADDR 2dfa0e1000 MISC 900208c PROCESSOR 0:306e4 
TIME 1399142359 SOCKET 1 APIC 20
  [94953.532217] EDAC MC1: 1 CE memory scrubbing error on 
CPU_SrcID#1_Channel#0_DIMM#0 (channel:0 slot:0 page:0x2dfa0e1 offset:0x0 
grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:1 channel_mask:3 
rank:0)
  [94953.532226] EDAC MC1: 1 CE memory scrubbing error on 
CPU_SrcID#1_Channel#1_DIMM#0 (channel:1 slot:0 page:0x2dfa0e1 offset:0x0 
grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 channel_mask:3 
rank:0)

  ---
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 touko  2 19:15 seq
   crw-rw 1 root audio 116, 33 touko  2 19:15 timer
  AplayDevices: Error: [Errno 2] No such file or directory
  ApportVersion: 2.14.1-0ubuntu3
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: Error: [Errno 2] No such file or directory
  CurrentDmesg:
   Error: command ['sh', '-c', 'dmesg | comm -13 --nocheck-order /var/log/dmesg 
-'] failed with exit code 1: comm: /var/log/dmesg: Permission denied
   dmesg: write failed: Broken pipe
  DistroRelease: Ubuntu 14.04
  InstallationDate: Installed on 2014-02-26 (66 days ago)
  InstallationMedia: Ubuntu-Server 14.04 LTS Trusty Tahr - Alpha amd64 
(20140219)
  MachineType: Dell Inc. PowerEdge R720
  Package: linux (not installed)
  PciMultimedia:

  ProcFB: 0 VESA VGA
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-24-generic 
root=UUID=c03eb237-955a-4dee-bba1-deded53df372 ro
  ProcVersionSignature: Ubuntu 3.13.0-24.46-generic 3.13.9
  RfKill: Error: [Errno 2] No such file or directory
  Tags:  trusty
  Uname: Linux 3.13.0-24-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups:

  WifiSyslog:

  _MarkForUpload: True
  dmi.bios.date: 01/16/2014
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.2.2
  dmi.board.name: 0DCWD1
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.2.2:bd01/16/2014:svnDellInc.:pnPowerEdgeR720:pvr:rvnDellInc.:rn0DCWD1:rvrA01:cvnDellInc.:ct23:cvr:
  dmi.product.name: PowerEdge R720
  dmi.sys.vendor: Dell Inc.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1315736/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1315736] Re: [Dell PowerEdge R720] Machine Check Exception

2014-05-13 Thread Tiago Antao
I will do this, but one important comment: I am on a supermicro, not a
dell. But the bug seems the same (same bug kernel line, and also java-
related taints)

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1315736

Title:
  [Dell PowerEdge R720] Machine Check Exception

Status in “linux” package in Ubuntu:
  Incomplete

Bug description:
  Dell PowerEdge 720 on ubuntu 14.04 shows MCE errors on dmesg. Dell
  support instructed to run DSET and BIOS hardware diagnostics. Neither
  of the tools showed any errors. Dell support said that if there was a
  hardware error it would have been shown on Dell logs and the probable
  reason for the dmesg log is a bug in ubuntu kernel MCE reporting.

  So, is it that following dmesg is because of a kernel bug in ubuntu
  14.04 server?

  [11562.171040] Please check user daemon is running.
  [94953.306404] sbridge: HANDLING MCE MEMORY ERROR
  [94953.306415] CPU 1: Machine Check Exception: 0 Bank 9: 8c4b000800c0
  [94953.306416] TSC 0 ADDR 2dfa0e1000 MISC 9800080168c PROCESSOR 0:306e4 
TIME 1399142359 SOCKET 1 APIC 20
  [94953.306422] sbridge: HANDLING MCE MEMORY ERROR
  [94953.306423] CPU 1: Machine Check Exception: 0 Bank 10: 8c5800c1
  [94953.306424] TSC 0 ADDR 2dfa0e1000 MISC 900208c PROCESSOR 0:306e4 
TIME 1399142359 SOCKET 1 APIC 20
  [94953.532217] EDAC MC1: 1 CE memory scrubbing error on 
CPU_SrcID#1_Channel#0_DIMM#0 (channel:0 slot:0 page:0x2dfa0e1 offset:0x0 
grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:1 channel_mask:3 
rank:0)
  [94953.532226] EDAC MC1: 1 CE memory scrubbing error on 
CPU_SrcID#1_Channel#1_DIMM#0 (channel:1 slot:0 page:0x2dfa0e1 offset:0x0 
grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 channel_mask:3 
rank:0)

  ---
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 touko  2 19:15 seq
   crw-rw 1 root audio 116, 33 touko  2 19:15 timer
  AplayDevices: Error: [Errno 2] No such file or directory
  ApportVersion: 2.14.1-0ubuntu3
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: Error: [Errno 2] No such file or directory
  CurrentDmesg:
   Error: command ['sh', '-c', 'dmesg | comm -13 --nocheck-order /var/log/dmesg 
-'] failed with exit code 1: comm: /var/log/dmesg: Permission denied
   dmesg: write failed: Broken pipe
  DistroRelease: Ubuntu 14.04
  InstallationDate: Installed on 2014-02-26 (66 days ago)
  InstallationMedia: Ubuntu-Server 14.04 LTS Trusty Tahr - Alpha amd64 
(20140219)
  MachineType: Dell Inc. PowerEdge R720
  Package: linux (not installed)
  PciMultimedia:

  ProcFB: 0 VESA VGA
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-24-generic 
root=UUID=c03eb237-955a-4dee-bba1-deded53df372 ro
  ProcVersionSignature: Ubuntu 3.13.0-24.46-generic 3.13.9
  RfKill: Error: [Errno 2] No such file or directory
  Tags:  trusty
  Uname: Linux 3.13.0-24-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups:

  WifiSyslog:

  _MarkForUpload: True
  dmi.bios.date: 01/16/2014
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.2.2
  dmi.board.name: 0DCWD1
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.2.2:bd01/16/2014:svnDellInc.:pnPowerEdgeR720:pvr:rvnDellInc.:rn0DCWD1:rvrA01:cvnDellInc.:ct23:cvr:
  dmi.product.name: PowerEdge R720
  dmi.sys.vendor: Dell Inc.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1315736/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1315736] Re: [Dell PowerEdge R720] Machine Check Exception

2014-05-13 Thread Tiago Antao
We have now installed the new kernel, but as the bug is non-
deterministic, we will have to wait until it manifests itself.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1315736

Title:
  [Dell PowerEdge R720] Machine Check Exception

Status in “linux” package in Ubuntu:
  Incomplete

Bug description:
  Dell PowerEdge 720 on ubuntu 14.04 shows MCE errors on dmesg. Dell
  support instructed to run DSET and BIOS hardware diagnostics. Neither
  of the tools showed any errors. Dell support said that if there was a
  hardware error it would have been shown on Dell logs and the probable
  reason for the dmesg log is a bug in ubuntu kernel MCE reporting.

  So, is it that following dmesg is because of a kernel bug in ubuntu
  14.04 server?

  [11562.171040] Please check user daemon is running.
  [94953.306404] sbridge: HANDLING MCE MEMORY ERROR
  [94953.306415] CPU 1: Machine Check Exception: 0 Bank 9: 8c4b000800c0
  [94953.306416] TSC 0 ADDR 2dfa0e1000 MISC 9800080168c PROCESSOR 0:306e4 
TIME 1399142359 SOCKET 1 APIC 20
  [94953.306422] sbridge: HANDLING MCE MEMORY ERROR
  [94953.306423] CPU 1: Machine Check Exception: 0 Bank 10: 8c5800c1
  [94953.306424] TSC 0 ADDR 2dfa0e1000 MISC 900208c PROCESSOR 0:306e4 
TIME 1399142359 SOCKET 1 APIC 20
  [94953.532217] EDAC MC1: 1 CE memory scrubbing error on 
CPU_SrcID#1_Channel#0_DIMM#0 (channel:0 slot:0 page:0x2dfa0e1 offset:0x0 
grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c0 socket:1 channel_mask:3 
rank:0)
  [94953.532226] EDAC MC1: 1 CE memory scrubbing error on 
CPU_SrcID#1_Channel#1_DIMM#0 (channel:1 slot:0 page:0x2dfa0e1 offset:0x0 
grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:1 channel_mask:3 
rank:0)

  ---
  AlsaDevices:
   total 0
   crw-rw 1 root audio 116,  1 touko  2 19:15 seq
   crw-rw 1 root audio 116, 33 touko  2 19:15 timer
  AplayDevices: Error: [Errno 2] No such file or directory
  ApportVersion: 2.14.1-0ubuntu3
  Architecture: amd64
  ArecordDevices: Error: [Errno 2] No such file or directory
  AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', 
'/dev/snd/timer'] failed with exit code 1:
  CRDA: Error: [Errno 2] No such file or directory
  CurrentDmesg:
   Error: command ['sh', '-c', 'dmesg | comm -13 --nocheck-order /var/log/dmesg 
-'] failed with exit code 1: comm: /var/log/dmesg: Permission denied
   dmesg: write failed: Broken pipe
  DistroRelease: Ubuntu 14.04
  InstallationDate: Installed on 2014-02-26 (66 days ago)
  InstallationMedia: Ubuntu-Server 14.04 LTS Trusty Tahr - Alpha amd64 
(20140219)
  MachineType: Dell Inc. PowerEdge R720
  Package: linux (not installed)
  PciMultimedia:

  ProcFB: 0 VESA VGA
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-24-generic 
root=UUID=c03eb237-955a-4dee-bba1-deded53df372 ro
  ProcVersionSignature: Ubuntu 3.13.0-24.46-generic 3.13.9
  RfKill: Error: [Errno 2] No such file or directory
  Tags:  trusty
  Uname: Linux 3.13.0-24-generic x86_64
  UpgradeStatus: No upgrade log present (probably fresh install)
  UserGroups:

  WifiSyslog:

  _MarkForUpload: True
  dmi.bios.date: 01/16/2014
  dmi.bios.vendor: Dell Inc.
  dmi.bios.version: 2.2.2
  dmi.board.name: 0DCWD1
  dmi.board.vendor: Dell Inc.
  dmi.board.version: A01
  dmi.chassis.type: 23
  dmi.chassis.vendor: Dell Inc.
  dmi.modalias: 
dmi:bvnDellInc.:bvr2.2.2:bd01/16/2014:svnDellInc.:pnPowerEdgeR720:pvr:rvnDellInc.:rn0DCWD1:rvrA01:cvnDellInc.:ct23:cvr:
  dmi.product.name: PowerEdge R720
  dmi.sys.vendor: Dell Inc.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1315736/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp


[Kernel-packages] [Bug 1315736] Re: [Dell PowerEdge R720] Machine Check Exception

2014-05-13 Thread Tiago Antao
Sami,

Good observation: I do not have a machine check exception. The
similarities are: a reported bug on the same line; similar behaviour;
and java involved. For reference I copy my kernel bug below (I get
several instances of this, only that the next ones are tainted). As soon
as I have a problem with the new upstream kernel I will report it back

May  9 09:55:29 wintermute kernel: [604868.582044] [ cut here 
]
May  9 09:55:29 wintermute kernel: [604868.582059] kernel BUG at 
/build/buildd/linux-3.13.0/mm/memory.c:3756!
May  9 09:55:29 wintermute kernel: [604868.582064] invalid opcode:  [#1] 
SMP 
May  9 09:55:29 wintermute kernel: [604868.582069] Modules linked in: veth 
xt_addrtype xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables 
x_tables bridge stp llc bnep rfcomm bluetooth aufs binfmt_misc kvm_amd kvm 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev aesni_intel aes_x86_64 
lrw gf128mul glue_helper ablk_helper cryptd parport_pc ppdev psmouse 
amd64_edac_mod sp5100_tco serio_raw edac_core lp fam15h_power k10temp i2c_piix4 
edac_mce_amd mac_hid parport hid_generic usbhid hid usb_storage ixgbe igb mdio 
i2c_algo_bit dca ahci ptp libahci pps_core
May  9 09:55:29 wintermute kernel: [604868.582148] CPU: 21 PID: 25260 Comm: 
java Not tainted 3.13.0-24-generic #46-Ubuntu
May  9 09:55:29 wintermute kernel: [604868.582152] Hardware name: Supermicro 
H8QG6/H8QG6, BIOS 3.512/16/2013
May  9 09:55:29 wintermute kernel: [604868.582156] task: 8876d3985fc0 ti: 
8871f58c8000 task.ti: 8871f58c8000
May  9 09:55:29 wintermute kernel: [604868.582159] RIP: 
0010:[81179051]  [81179051] handle_mm_fault+0xe61/0xf10
May  9 09:55:29 wintermute kernel: [604868.582171] RSP: :8871f58c9d98  
EFLAGS: 00010246
May  9 09:55:29 wintermute kernel: [604868.582174] RAX: 0100 RBX: 
7fa583801ea0 RCX: 8871f58c9b18
May  9 09:55:29 wintermute kernel: [604868.582177] RDX: 8876d3985fc0 RSI: 
 RDI: 8020286009e6
May  9 09:55:29 wintermute kernel: [604868.582180] RBP: 8871f58c9e20 R08: 
 R09: 00a9
May  9 09:55:29 wintermute kernel: [604868.582182] R10: 0001 R11: 
 R12: 883fb68b30e0
May  9 09:55:29 wintermute kernel: [604868.582185] R13: 882e351b2600 R14: 
88702aceec80 R15: 0080
May  9 09:55:29 wintermute kernel: [604868.582188] FS:  7fa5603f2700() 
GS:882fe7d4() knlGS:
May  9 09:55:29 wintermute kernel: [604868.582192] CS:  0010 DS:  ES:  
CR0: 8005003b
May  9 09:55:29 wintermute kernel: [604868.582194] CR2: 7fa583a05620 CR3: 
007861d59000 CR4: 000407e0
May  9 09:55:29 wintermute kernel: [604868.582198] Stack:
May  9 09:55:29 wintermute kernel: [604868.582200]  8871f58c9e20 
88702aceec80 7fad7d38fd70 7fa583804020
May  9 09:55:29 wintermute kernel: [604868.582241]  2190 
7fad7401bb68  0002
May  9 09:55:29 wintermute kernel: [604868.582266]  887101ef5e20 
7fad781a900f 88a9 ff03
May  9 09:55:29 wintermute kernel: [604868.582283] Call Trace:
May  9 09:55:29 wintermute kernel: [604868.582297]  [817219a4] 
__do_page_fault+0x184/0x560
May  9 09:55:29 wintermute kernel: [604868.582311]  [82fc] ? 
acct_account_cputime+0x1c/0x20
May  9 09:55:29 wintermute kernel: [604868.582321]  [8109d76b] ? 
account_user_time+0x8b/0xa0
May  9 09:55:29 wintermute kernel: [604868.582329]  [8109dd84] ? 
vtime_account_user+0x54/0x60
May  9 09:55:29 wintermute kernel: [604868.582338]  [81721d9a] 
do_page_fault+0x1a/0x70
May  9 09:55:29 wintermute kernel: [604868.582349]  [8171e208] 
page_fault+0x28/0x30
May  9 09:55:29 wintermute kernel: [604868.582353] Code: ff 48 89 d9 4c 89 e2 
4c 89 ee 4c 89 f7 44 89 4d c8 e8 34 c1 ff ff 85 c0 0f 85 94 f5 ff ff 49 8b 3c 
24 44 8b 4d c8 e9 68 f3 ff ff 0f 0b be 8e 00 00 00 48 c7 c7 18 25 a6 81 44 89 
4d c8 e8 18 e7 
May  9 09:55:29 wintermute kernel: [604868.582415] RIP  [81179051] 
handle_mm_fault+0xe61/0xf10
May  9 09:55:29 wintermute kernel: [604868.582421]  RSP 8871f58c9d98
May  9 09:55:29 wintermute kernel: [604868.582426] ---[ end trace 
77f5d1b963750a41 ]---

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1315736

Title:
  [Dell PowerEdge R720] Machine Check Exception

Status in “linux” package in Ubuntu:
  Incomplete

Bug description:
  Dell PowerEdge 720 on ubuntu 14.04 shows MCE errors on dmesg. Dell
  support instructed to run DSET and BIOS hardware diagnostics. Neither
  of the tools showed any errors. Dell support said that if there was a
  hardware error it would have been shown on Dell logs and the probable