Public bug reported:
The purpose of this bug is to report/emphasize the severe number of
system hangs, which require power-cycling, on our deployment of servers
running the 10.04LTS (Lucid) release. The issue here is essentially
identical to that reported for the 3.2 kernels on 12.04LTS in bug
#1154876 at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1154876
almost 2 weeks ago.
The kernel version which fails for us in the field is a 2.6.38-16
Ubuntu kernel. Though the power-cycling recovery in the field means we
never received a crash dump for analysis we've been able to reproduce
what appears to be the identical symptoms on an in-house VMware testbed.
The exact same failure as the other bug also occurs in our testbed on
Lucid using the very latest stock 3.0.0-32-generic kernel from the
repository.
See the other bug for details of scripts/loads and details of a kdb
session during the hang. I didn't reproduce all those attachments for
this bug report since everything for this version of the system would be
similar. Essentially all processes remain "stuck" in
__alloc_pages_nodemask and never succeed in allocating memory. All CPUs
are busy rerunning each process to try again, to no avail. The OOM
logic is not invoked on the 3.0 kernel while in this hang, even though
plenty of OOMs had occurred in the time leading up to the hang. In the
2.6.38 kernels it looks essentially the same except that even during the
hang we see the OOM select_bad_process() function continually called but
no OOM candidate is returned, due to a pending one previously selected.
But the end result is identical: continual memory allocation failures,
short sleeps, try again, and the system becomes totally non-responsive
other than for "pings". The serial console and all other CLI or GUI
goes totally dead, with no response. The only thing one can do is break
in with kdb to investigate, as shown in the other bug.
Before the hangs even occur we will also see very heavy pgscank and
pgscand numbers, as reported by the "sar" facility. On our production
machines these can each hit millions of page scans per second and seem
to occur even when there are several gigabytes of available memory. The
system hangs are invariably immediately preceded by exceptionally high
levels of pgscank and usually pgscand as well.
We really need a remedy or some kind of workaround for this issue.
Requested system release info:
marc@direct-10-04:~$ lsb_release -rd
Description: Ubuntu 10.04.4 LTS
Release: 10.04
Requested package info:
marc@direct-10-04:~$ dpkg -l | fgrep linux-image-3.0.0
ii linux-image-3.0.0-32-generic 3.0.0-32.50~lucid1
Linux kernel image for version 3.0.0 on x86/
ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-3.0.0-32-generic 3.0.0-32.50~lucid1
ProcVersionSignature: Ubuntu 3.0.0-32.50~lucid1-generic 3.0.65
Uname: Linux 3.0.0-32-generic x86_64
Architecture: amd64
Date: Wed Mar 27 20:45:58 2013
InstallationMedia: Ubuntu 10.04.3 LTS "Lucid Lynx" - Release amd64 (20110719.2)
ProcEnviron:
PATH=(custom, no user)
LANG=en_US.utf8
SHELL=/bin/bash
SourcePackage: linux-lts-backport-oneiric
** Affects: linux-lts-backport-oneiric (Ubuntu)
Importance: Undecided
Status: New
** Tags: amd64 apport-bug lucid
--
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1161202
Title:
All our Lucid 2.6 and 3.0 kernels hang with heavy memory loads
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-lts-backport-oneiric/+bug/1161202/+subscriptions
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs