[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2014-12-03 Thread Rolf Leggewie
oneiric has seen the end of its life and is no longer receiving any updates. Marking the oneiric task for this ticket as Won't Fix. ** Changed in: linux-lts-backport-oneiric (Ubuntu Oneiric) Status: Confirmed = Won't Fix -- You received this bug notification because you are a member of

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2014-01-15 Thread Launchpad Bug Tracker
** Branch linked: lp:ubuntu/lucid-proposed/linux-lts-backport-oneiric -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2014-01-15 Thread Launchpad Bug Tracker
** Branch linked: lp:ubuntu/lucid-updates/linux-lts-backport-oneiric -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-06-13 Thread Launchpad Bug Tracker
This bug was fixed in the package linux - 2.6.32-48.110 --- linux (2.6.32-48.110) lucid; urgency=low [Steve Conklin] * Release Tracking Bug - LP: #1186340 [ Stefan Bader ] * (config) Import Xen specific config options from ec2 - LP: #1177431 * SAUCE: xen: Send

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-06-07 Thread Stefan Bader
No, you are guaranteed *not* to experience _this_ issue on a HVM guest. That is because HVM guests use a completely different spinlock implementation. It is possible that you see hangs/lockups but please open a new bug report for those because it is a different issue. -- You received this bug

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-06-06 Thread Brandon
Am I the only one experiencing this issue in HVM machines? Also, does anyone happen to know if theres a precise AMI that fixes this issue? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title:

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-06-04 Thread Launchpad Bug Tracker
** Branch linked: lp:ubuntu/lucid-proposed/linux-ec2 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types To manage

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-06-04 Thread Stefan Bader
Did test runs for the current EC2 kernel (which we actually do not expect to be affected at all) and the proposed virtual/server flavour manually inserted into a Xen PV guest that is based on the cloud-images. Both passed. ** Tags removed: verification-needed-lucid ** Tags added:

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-06-03 Thread Brad Figg
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed' to 'verification-done'. If verification is not done by one week from today, this fix will

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-05-07 Thread Launchpad Bug Tracker
** Branch linked: lp:ubuntu/lucid-security/linux-lts-backport-oneiric -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-05-07 Thread Launchpad Bug Tracker
** Branch linked: lp:ubuntu/precise-security/linux-ti-omap4 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types To

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-04-08 Thread Launchpad Bug Tracker
This bug was fixed in the package linux - 3.5.0-27.46 --- linux (3.5.0-27.46) quantal-proposed; urgency=low [Steve Conklin] * Release Tracking Bug - LP: #1159991 [ Steve Conklin ] * Start New Release [ Upstream Kernel Changes ] * crypto: user - fix info leaks in

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-04-02 Thread Launchpad Bug Tracker
This bug was fixed in the package linux-lts-backport-oneiric - 3.0.0-32.51~lucid1 --- linux-lts-backport-oneiric (3.0.0-32.51~lucid1) lucid-proposed; urgency=low [Steve Conklin] * Release Tracking Bug - LP: #1158541 [ Upstream Kernel Changes ] * printk: fix buffer

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-04-01 Thread Launchpad Bug Tracker
This bug was fixed in the package linux - 3.0.0-32.51 --- linux (3.0.0-32.51) oneiric-proposed; urgency=low [Steve Conklin] * Release Tracking Bug - LP: #1158340 [ Upstream Kernel Changes ] * printk: fix buffer overflow when calling log_prefix function from

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-03-25 Thread Stefan Bader
Verified that quantal-proposed passes the pgslam testcase. ** Tags removed: verification-needed-quantal ** Tags added: verification-done-quantal ** Changed in: linux (Ubuntu Quantal) Status: Confirmed = Fix Committed ** Changed in: linux (Ubuntu Precise) Assignee: Stefan Bader

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-03-25 Thread Brad Figg
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed' to 'verification-done'. If verification is not done by one week from today, this fix will

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-03-25 Thread Stefan Bader
** Changed in: linux (Ubuntu Oneiric) Status: Confirmed = Fix Committed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-03-25 Thread Stefan Bader
Verified on Oneiric. ** Tags removed: verification-needed-oneiric ** Tags added: verification-done-oneiric -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-03-21 Thread Brad Figg
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed' to 'verification-done'. If verification is not done by one week from today, this fix will

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-03-18 Thread Launchpad Bug Tracker
This bug was fixed in the package linux - 3.2.0-39.62 --- linux (3.2.0-39.62) precise-proposed; urgency=low [Brad Figg] * Release Tracking Bug - LP: #1134424 [ Herton Ronaldo Krzesinski ] * Revert SAUCE: samsung-laptop: disable in UEFI mode - LP: #1117693 * d-i:

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-03-18 Thread Nathan O'Sullivan
Can confirm this affects Dom0; with kernel 3.2.0-38 the attached pgslam and Xen set as: GRUB_CMDLINE_XEN=dom0_mem=7000M dom0_max_vcpus=24 dom0_vcpus_pin I can get a crash within two minutes on 3.2.0-38. Still testing 3.2.0-39 but it certainly gets past the two minute park. -- You received this

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-03-11 Thread Jérôme Petazzoni
For the record: we also experienced this issue on EC2 PV instances (m2.2xlarge in us-east-1), using a very intensive workload (~1000 LXC containers running a mix of web apps and databases). Running any kind of 3.X kernel (3.2, 3.4, 3.6) causes our test workload to crash in less than one hour. The

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-03-07 Thread Jeffrey Gelens
I'm testing 4 machines with pgslam with the kernel in precise-proposed. There have been no issues for several days, it would be great if someone could change the tag 'verification-needed' to 'verification-done', so that it can be in the main repos. Thanks! -- You received this bug notification

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-03-07 Thread Stefan Bader
** Tags removed: verification-needed-precise ** Tags added: verification-done-precise -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-03-05 Thread Stefan Bader
It is not completely surprising as dom0 actually is a PV guest. One with special privileges though. But it is good to have confirmation that this also would affect dom0 and is also fixed by the same change. As said in comment #83, there is currently a Precise kernel in proposed that will

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-03-05 Thread Jeffrey Gelens
Testing the fix in precise. I'll update this bug report with my findings later this week to be sure it works. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-03-04 Thread Fleish
I've been having similar lockups on dom0's running Ubuntu 12.04 LTS w/kernel linux-image-3.2.0-32-generic as a dom0. Below is output from the last one, which shows the same stack trace being seen in the virtual kernel image. I took the pgslam/setup_hi1_4xlarge_for_crash_test.sh test scripts and

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-03-02 Thread Brad Figg
** Tags added: verification-needed-precise -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types To manage notifications

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-03-02 Thread Brad Figg
This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed' to 'verification-done'. If verification is not done by one week from today, this fix will

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-03-02 Thread Mike Heffner
Stefan: Thanks for the building the oneiric backport kernel. We have taken your advice though and updated to precise running the 3.2.0-38 patched kernel you uploaded. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu.

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-03-01 Thread Launchpad Bug Tracker
This bug was fixed in the package linux - 3.8.0-9.18 --- linux (3.8.0-9.18) raring; urgency=low [Tim Gardner] * Release Tracking Bug - LP: #1135937 * [Config] CONFIG_PATA_ACPI=m - LP: #1084783 [ Upstream Kernel Changes ] * intel_idle: stop using driver_data for

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-02-27 Thread Stefan Bader
The change from comment #79 is now upstream and also on its way to come back into releases via upstream stable. So one of the future uploads would be carrying that change. Note that Oneiric is quite close to be without further support. It might be wise to think about upgrading to an LTS. But ok, I

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-02-27 Thread Stefan Bader
** Also affects: linux (Ubuntu Lucid) Importance: Undecided Status: New ** Also affects: linux-lts-backport-oneiric (Ubuntu Lucid) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Oneiric) Importance: Undecided Status: New ** Also affects:

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-02-26 Thread Mike Heffner
What is the best way to test if we are impacted by this bug? We run the following image on c1.xlarge's and see nodes die about every 1-2 days now under a continuous 50% CPU load. The nodes will fail the EC2 instance status check and all monitoring daemons on the node will stop reporting. However,

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-02-15 Thread Stefan Bader
After finally having a breakthrough in understanding the source of the lockup and further discussions upstream, the proper turns out to be to change the way waiters are woken when a spinlock gets freed. A slightly more verbose explanation of this is in the attached patch that likely goes upstream.

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-02-14 Thread Stefan Bader
** Changed in: linux (Ubuntu Precise) Status: In Progress = Fix Committed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-02-14 Thread Steven Noonan
Stefan, let's be sure this fix gets upstream as well. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types To manage

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-02-05 Thread Stefan Bader
Started to push for Precise now. Depending on whether I get into this or the next cycle it could be 3 or 6 weeks. There will be requests for testing and release notifications posted to this bug when it happens. -- You received this bug notification because you are a member of Ubuntu Bugs, which

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-02-04 Thread Jeffrey Gelens
Any idea when this will patch will arrive in the precise kernel packages? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-02-04 Thread Stefan Bader
Somehow I had hoped to find some way to understand why the fix works (or what exactly goes wrong without it). But then other things and bad long- time memory (sort of) came into play and this has not really progressed much. So I try to actually get this into Precise at least (since no testing

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-02-04 Thread Stefan Bader
** Description changed: + SRU Justification: + + Impact: Running lots of threads which utilize spinlocks (the pgslam + testcase is quite successful in causing this), we hit a stage where the + spinlock is still locked but none of the CPUs seem to be actively + holding it. The reason for this is

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2013-02-04 Thread Jeffrey Gelens
Thanks for the update, hope this works as this is a big problem for our servers. We're thinking about downgrading to Lucid, but I'm eager to try this out first. When do you think it can it can be available as normal package update? -- You received this bug notification because you are a member

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-11-28 Thread Stefan Bader
It was circulated on the mailing list, but for simpler reference I am adding it here (ok, I did clean up the comment section a bit). ** Patch added: 0001-xen-pv-spinlock-Never-enable-interrupts-in-xen_spin_.patch

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-11-28 Thread Ubuntu Foundations Team Bug Bot
** Tags added: patch -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types To manage notifications about this bug go to:

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-11-27 Thread Konrad Rzeszutek Wilk
Stefan, Is your patch somewhere accessible? Thx -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types To manage

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-11-21 Thread Rick Branson
FWIW -- we have dozens of SSD instances running the nopvspinlock patched kernel that each serve ~1000 connections doing 10k+ QPS / 10k+ IOPS and have uptimes in the weeks range now. I'll talk to the team about getting some time to slot in the test1 kernel. -- You received this bug notification

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-11-14 Thread Steven Noonan
Well, the hosts are still running fine. pgslam.py eventually bombed out on all of them with InternalError: could not open relation with OID. Not sure why, but either way there were no lockups. It's possible test1 resolves the issue. Konrad? -- You received this bug notification because you are a

Re: [Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-11-13 Thread Konrad Rzeszutek Wilk
On Mon, Nov 12, 2012 at 08:47:11PM -, Steven Noonan wrote: Stefan, I did our internal SSD and network performance testing qualifications on hi1.4xlarge with CONFIG_PARAVIRT_SPINLOCKS=n in the kernel build. There's very little discernible difference in performance (within statistical

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-11-13 Thread Steven Noonan
I've got 'test1' running on 19 hi1.4xlarge instances using the pgslam.py workload. They're all up to 40G of capacity used right now, still no lockups. I'll leave this running for a while longer and see what happens (9 hours so far). -- You received this bug notification because you are a member

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-11-12 Thread Steven Noonan
Stefan, I did our internal SSD and network performance testing qualifications on hi1.4xlarge with CONFIG_PARAVIRT_SPINLOCKS=n in the kernel build. There's very little discernible difference in performance (within statistical noise ranges), and the benefits of disabling it are pretty clear. Can

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-11-12 Thread Stefan Bader
Steven, as I said before, that option affects both Xen and KVM behaviour. So I want to avoid that as much as possible. I you could please give the test1 kernel a try which only does not re-enable interrupts of the guest while doing the hypercall. In my testing this also would not cause the hang

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-22 Thread Stefan Bader
I just put a test1 version of a kernel to the same place as the nospinlock one. From my tests it seems that the problem in the Xen paravirt spinlock implementation is the fact that they re-enable interrupts (xen upcall event channel for that vcpu) during the hypercall to poll for the spinlock irq.

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-20 Thread Prakash Janakiraman
Mike/Stefan, we're seeing the same freezing issue on an EC2 m2.2xl instance - would love to hear if you had any success with the modified kernel. Going to hvm would require us to over-provision to the 4xl instance type, so we just backed out to an earlier distribution. Broken: uname -a Linux

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-20 Thread Mike Krieger
Early results are promising on the no spin lock kernel...we're at 14 instances running it and no lockups (been running it for 3 days). We're going to see how the weekend goes and if successful, roll it out widely. -- You received this bug notification because you are a member of Ubuntu Bugs,

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-17 Thread Stefan Bader
So after all nomodeset only makes it less likely, not impossible. :/ Ok, just to be completely sure about the pv spinlock side, I compiled a recent Precise kernel with just that disabled and put it on people[1]. If that survives a more rigorous testing, then at least that part of the conclusion

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-17 Thread Mike Krieger
Thanks Stefan, we just re-rotated the machine that locked up twice yesterday with your non-pv-spinlock kernel, we'll keep you updated on what happens. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu.

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-16 Thread Joseph Salisbury
** Tags removed: kernel-key -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types To manage notifications about this bug

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-16 Thread Stefan Bader
Summing up my current observations and theory. In the dump I too, I had 8 VPUs (0-7). With regard of spinlocks I saw the following: CPU#0 claimed to be waiting on the spinlock for runqueue#0 (its own). Though the actual lock was free (this could mean that it was just woken out of the wait but

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-16 Thread Mike Krieger
Unfortunately, it looks like we're still freezing up with noautogroup enabled (set at boot using Grub). Booted with: Oct 17 00:07:45 localhost kernel: [0.00] Command line: root=UUID=3ad27d04-4ecf-493d-bb19-4710c3caf924 ro console=hvc0 noautogroup uname -a Linux

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-15 Thread Stefan Bader
I think that pretty surely is the reason that Arch is different. That flag will cause Xen to use its own paravirtualized spinlocks which I am suspecting of causing the problems (in some way related to having tasks in taskgroups which autogroup does automatically to happen). This seems, when the

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-12 Thread Stefan Bader
I would be very surprised if there was anything Ubuntu specific in the autogroup area. The whole kernel source is mainly what is upstream. There are a few additional drivers, but really only a couple and I don't think they get used here. One thing might be noteworthy, if 3.5.6-1-arch translates

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-12 Thread Stefan Bader
Sorry, right now v3.5.6 seems to have no 64bit packages. Working on it... -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-12 Thread Stefan Bader
Also I found that simply installing actually does not yet work because the pvgrub won't pick it up. Though it is simple a matter of editing /boot/grub/menu.lst and replacing the 3.2.0-x-generic (for example) of vmlinuz and initrd (in the first boot entry) by the new version number. -- You

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-12 Thread Steven Noonan
Re: pv-grub, yeah. The hook that updates the menu.lst is pretty ugly, and specifically looks for *-virtual kernels, assuming that no other kernel could possibly support Xen. Why not grep for the right option in the installed config-$(uname -r)? Anyway, 3.5.6 -might- fix it, but I only see one

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-12 Thread Stefan Bader
At least for me, trying mainline v3.5.6-quantal has the same result as before (getting the lockups). Currently having a run with the same kernel but noautogroup as kernel command line which survives so far but the DB is just at about 550M. -- You received this bug notification because you are a

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-12 Thread Stefan Bader
I suspected the same commit in v3.5.6 but neither picking that one nor actually the full set seems good. The noautogroup run still runs... -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title:

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-12 Thread Stefan Bader
My next step(s) will be to convert the current PV setup into a HVM and try to see whether this also breaks. In case it does I would also run this HVM disks from a KVM host. I hope to see whether this behaviour is related to PV, Xen in general or maybe a completely generic problem. -- You

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-12 Thread Stefan Bader
Seems this is tied to a PV guest. The HVM guest using the exactly same installation, memory, VCPUs and all did not show any issues. But there is one big difference here. Only the PV guest uses the paravirtualized spinlocks (which I think, do allow a bit of nestedness). So right now I would narrow

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-12 Thread Mike Krieger
Just as a note, we run the same workload with no issues using Precise on HVMs, it only reproduces on PV in production, so your findings match our experience. Just double checking because I might have missed something--is there an Ubuntu based setup with auto groups off that doesn't freeze up? --

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-12 Thread Andrew Shieh
Mike, I've been successfully running this configuration: hi1.4xlarge instance in VPC ami-eafa5883, official Ubuntu 12 PV instance-store AMI linux 3.2.0-31-virtual (picked up on upgrade), booting with grub kernel option noautogroup These are running a write-heavy mysql replica load to XFS/MD/LVM

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-12 Thread Steven Noonan
Stefan, here's the Arch Linux kernel package tree: http://projects.archlinux.org/svntogit/packages.git/tree/linux/repos /core-x86_64?id=6b8ed4e6660afe873aef3a207b187c5eb124c855 They build from a release tarball on kernel.org with only 4 patches applied (none of which touch scheduling or anything

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-11 Thread Steven Noonan
OK, with a few more iterations over kernel configs, it looks like it all comes down to this: --- config-3.2.0-31-virtual 2012-10-10 01:02:10.0 + +++ config-3.2.0-31-virtual-noautogroup 2012-10-11 01:33:14.886307000 + @@ -144,7 +144,7 @@ CONFIG_USER_NS=y CONFIG_PID_NS=y

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-11 Thread Andrew Shieh
Steven, thank you for narrowing down the problem. I am now testing noautogroup on some test database instances where I've been experiencing about a 20% failure rate per day. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu.

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-11 Thread Stefan Bader
Rick, I was using a Xen guest with 8 cores and ~15G of memory. Host was a CentOs 5.(6 I believe) with Xen 3.4.3. But I also saw it happen when the same host runs Precise with Xen 4.1.2. Stephen, now that is very interesting info. So if the kernel commandline would help but not the sysctl, that

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-11 Thread Steven Noonan
Out of curiosity I tried an Arch Linux instance (running Linux 3.5.6-1-ARCH), which also has CONFIG_SCHED_AUTOGROUP: # zgrep AUTOGROUP /proc/config.gz CONFIG_SCHED_AUTOGROUP=y I ran the same pgslam workload on it, and it filled 64G of the /var/lib/postgres md-raid before I stopped it. This

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-11 Thread Steven Noonan
It looks like Quantal has the same issue. Thing didn't even fill more than 100MB before deadlocking. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-10 Thread Steven Noonan
Stefan, the kernel version in the Amazon Linux AMI that Matt pointed at is 3.2.21-1.32.6.amzn1.x86_64, so it is very close to comparable with the affected Ubuntu kernel (yes, there are source differences, but they at least have a merge base of 3.2.21 so they share significant lineage). I was able

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-10-10 Thread Rick Branson
Stefan, We threw the debug kernel into production and it ended up dying after a few hours so it's highly likely you're right about it just slowing down the race behavior. What was the setup you used to repro the bug on your side with the scripts I gave you? Rick -- You received this bug

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-09-24 Thread Stefan Bader
Matt, and what kernel version would the Amazon Linux AMI be? Rick, that could in the most annoying way confirm a race somewhere as we suspected. Add some code that makes things slower and it becomes rare or goes way completely. -- You received this bug notification because you are a member of

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-09-21 Thread Rick Branson
We've ran it all the way up to 334GB on the debug kernel and it's fine. Early next week we're going to deploy the debug kernel onto a production host. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu.

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-09-20 Thread Stefan Bader
Rick, thanks for the scripts. The good news is that I seem to be able to reproduce the lockup on a local machine with less memory and CPUs (not having a really big box at my hands). Now I should be able to get more info out of that (given a bit of time). -- You received this bug notification

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-09-20 Thread Matt Wilson
For what it's worth, I started running this test case on the Amazon Linux AMI (ami-aecd60c7) yesterday. It hasn't crashed. The DB is now 96 GiB. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-09-19 Thread Rick Branson
We've been able to reproduce the bug in a more isolated environment. I wrote a Python script (pgslam.py) that generates the (correct enough) similar load to our production traffic. In addition, I wrote a bash script that will setup a hi1.4xlarge EC2 instance to reproduce the issue. During the

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-09-19 Thread Rick Branson
** Attachment added: pgslam.py https://bugs.launchpad.net/ubuntu/+source/linux-lts-backport-oneiric/+bug/1011792/+attachment/3324324/+files/pgslam.py -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu.

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-09-19 Thread Rick Branson
** Attachment added: setup_hi1_4xlarge_for_crash_test.sh https://bugs.launchpad.net/ubuntu/+source/linux-lts-backport-oneiric/+bug/1011792/+attachment/3324325/+files/setup_hi1_4xlarge_for_crash_test.sh -- You received this bug notification because you are a member of Ubuntu Bugs, which is

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-09-19 Thread Rick Branson
FYI, the above repro was done with ami-3c994355. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types To manage

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-09-19 Thread Rick Branson
Stefan -- we are trying to repro with the debug kernel image you linked. So far it survived the initial one minute death, but we'll run the test for several more hours and see if it still crashes. Right now the database isn't very large, so it's not doing any read I/O. Once the database starts to

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-09-18 Thread Rick Branson
We've tried both the autogroups disabled and a kernel with sched_clock_stable=0 forced. Both crashed. Going to try a lock debugging kernel next. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-09-11 Thread Rick Branson
(I work with Mike Krieger.) Our quick fix for this was to use HVM guests. Unfortunately the only way for us to reproduce this bug at the moment is to put a database box under production load, so we have to be judicious in our approach. It also takes several hours for us to bring up a new guest

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-09-05 Thread Stefan Bader
This has been quiet for a while. Was there chance to try the debug kernels or with autogroup disabled? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-08-21 Thread Stefan Bader
It is hard to say for sure. Indeed we have the same jump of the time. Though the problem on the lkml report seemed more like one task going into schedule and never really getting scheduled. While here it looks like a real locking issue. All CPUs (except CPU1) look to be waiting on some

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-08-21 Thread Joseph Salisbury
** Tags added: kernel-da-key -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types To manage notifications about this

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-08-21 Thread Joseph Salisbury
** Tags added: kernel-key -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1011792 Title: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types To manage notifications about this bug

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-08-21 Thread Matt Wilson
@Matt, when you produce those cpu stacktraces, how do you do that? Is that from a dump or somehow tapping into the still running instance? @smb, these are traces from running, but unresponsive, instances. I pull the traces from the vCPU context in the hypervisor, then resolve symbols from the

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-08-20 Thread Stefan Bader
Hi Mike, could you post the dmesg of that instance? Or actually if it is running for a while, boot messages may be gone from the ring buffer. Probably sudo grep -r . /sys/hypervisor in the guest is good enough. So the issue was already there with Natty (2.6.38) but happens more often since

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-08-20 Thread Stefan Bader
In the hope that maybe this catches something I put some Precise kernels to http://people.canonical.com/~smb/lp1011792/. Those have lock debugging enabled. If you could install the virtual packages (the extras package is not required) to one Precise ami and let it run. If that locks up without any

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-08-20 Thread Rick Branson
We see the same clock jump issues and lockup issues as in this thread: http://lists.xen.org/archives/html/xen-devel/2012-04/msg00888.html Could it be related? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu.

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-08-18 Thread Mike Krieger
Hi Stefan, Yes, the same instance that froze (collected it after a reboot). - looking at the same instance type, does it happen on all of them sooner or later or are there exceptions? There is one of our instances of that type that is under the same load but hasn't frozen in weeks. Since

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-08-18 Thread Mike Krieger
One more note--I tried stressing out the same instance using bonnie++ and some CPU burning yes processes, but it stood up okay. It does seem somehow related to throwing heavy read traffic at PostgreSQL on these instances. -- You received this bug notification because you are a member of Ubuntu

[Bug 1011792] Re: Kernel lockup running 3.0.0 and 3.2.0 on multiple EC2 instance types

2012-08-17 Thread Stefan Bader
Thanks, Mike for the details. Just to make sure, you collected the info from the same instance that locked up (either before or after a reboot)? That would make sure that whatever information about the host is really belonging to the host where the problem happened. As for more details, not right

  1   2   >