[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
This bug was fixed in the package linux-ec2 - 2.6.32-346.51 --- linux-ec2 (2.6.32-346.51) lucid-proposed; urgency=low [ Stefan Bader ] * SAUCE: Update spinlock handling code - LP: #929941 * SAUCE: Use ticket locks for Xen 3.0.2+ - LP: #929941 * Rebased to Ubuntu-2.6.32-41.93 * Release Tracking Bug - LP: #1021084 [ Ubuntu: 2.6.32-41.93 ] * No change upload to fix .ddeb generation in the PPA. [ Ubuntu: 2.6.32-41.92 ] * drm/i915: Move Pineview CxSR and watermark code into update_wm hook. - LP: #1004707 * drm/i915: Add CxSR support on Pineview DDR3 - LP: #1004707 -- Stefan Bader stefan.ba...@canonical.com Mon, 25 Jun 2012 11:20:40 +0200 ** Changed in: linux-ec2 (Ubuntu) Status: Fix Committed = Fix Released -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
We started with 30 instances running with the -proposed ec2 kernel, in our production environment. We plan to gradually boot more and update this ticket once we collect enough instance hours to be confident that this bug is not present in that version. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
I would mark this as verified as the intended change has been running in test kernels for some time before and as Ilan said it would take longer than the verification period to hit it. ** Tags removed: verification-needed-lucid ** Tags added: verification-done-lucid -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
We are very confident that this bug is not present in the v1 kernel, as we have been only running instances with that kernel for some months now, and we have not seen these issues anymore. Me and noav were some of the original reporters of this bug. We can help testing the kernel currently in -proposed, but as others have already commented, one week would not be enough to collect enough instance hours. Two weeks would give us much more confidence. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
This bug is awaiting verification that the kernel for lucid in -proposed solves the problem (2.6.32-346.51). Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-lucid' to 'verification-done-lucid'. If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: verification-needed-lucid -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
Given the sporadic nature of this bug, it would take at least 2 weeks of testing before we could say with even a slight bit of confidence that a given kernel has stopping the crashes we were experiencing. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
** Branch linked: lp:ubuntu/lucid-proposed/linux-ec2 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
** Changed in: linux-ec2 (Ubuntu) Status: Incomplete = In Progress -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
** Description changed: + SRU Justification: + + Impact: The version of Xen patches we currently use for the ec2 kernel + have a serious flaw in the handling of nested spinlocks. This can result + in a complete deadlock under certain workloads. + + Fix: The spinlock handling code has been substantially restructured in + later versions of the patchset. The changes backport this but also + enable the use of ticket-spinlocks (as we do now) when compiling with + the compatibility level we use. + + Testcase: Not easy to reproduce. But feedback with the patchset applied + (see comment #32) look good. + + -- + After running for some indeterminate period of time, the 2.6.32-341-ec2 and 2.6.32-342-ec2 kernels stop responding when running on m2.2xlarge EC2 instances. No console output is emitted. Stack dumps gathered by examining CPU context information show that all VCPUs are stuck waiting on spinlocks. This could be a deadlock in the scheduling code. ProblemType: Bug DistroRelease: Ubuntu 10.04 Package: linux-image-2.6.32-341-ec2 2.6.32-341.42 ProcVersionSignature: User Name 2.6.32-341.42-ec2 2.6.32.49+drm33.21 Uname: Linux 2.6.32-341-ec2 x86_64 Architecture: amd64 Date: Fri Feb 10 01:56:17 2012 Ec2AMI: ami-55dc0b3c Ec2AMIManifest: (unknown) Ec2AvailabilityZone: us-east-1c Ec2InstanceType: m1.xlarge Ec2Kernel: aki-427d952b Ec2Ramdisk: unavailable ProcEnviron: - LANG=en_US.UTF-8 - SHELL=/bin/bash + LANG=en_US.UTF-8 + SHELL=/bin/bash SourcePackage: linux-ec2 -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
** Changed in: linux-ec2 (Ubuntu) Status: In Progress = Fix Committed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
Still seeing the crash with the most recent kernel update in lucid: 2.6.32-345-ec2 #49-Ubuntu SMP Haven't yet seen a crash on 2.6.32-345-ec2_2.6.32-345.47+lp929941v3_amd64.deb. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
We've had a few instances running on 2.6.32-345-ec2 #47+lp929941v3 linked from this ticket since 2012-05-09. So far those instances have been stable, but it is not possible for us to determine if the the crash has been resolved, or if the subset of instances we upgraded was just lucky enough to not trigger this bug. As mentioned before the crashes do not appear to be consistently reproducible, and hits our instances at random. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
We've had a customer report a very similar looking lockup on 3.0.0-20-virtual. Full version info, 3.0.0-20-virtual (buildd@yellow) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) ) #34~lucid1-Ubuntu SMP Wed May 2 17:24:41 UTC 2012 (Ubuntu 3.0.0-20.34~lucid1-virtual 3.0.30) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
We believe we are experiencing this bug as well. The most frequently impacted instance type in our environment seems to be c1.xlarge. However, it appears to be almost entirely at random and rarely hits the same instances twice. We're currently testing the new kernels on a very limited set of machines. We will report back with our experiences. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
Thank you for the information. We will begin limited testing of the latest kernels provided. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
Noah, thanks for testing and reporting the results. The first thing to do now, is to decide whether v1 or v3 should be the goal. v1 could be considered well tested by now. The downside I see with that, is that to avoid some problems on certain older hypervisor code, this uses real spinning spinlocks. Which means while waiting for a lock, the virtual cpu will busily wait (which could have some impact on the cloud hosts cpu usage. Also this gives no queuing, which means that getting the lock can be unfair in contented situations. The v3 kernel would in principle use the same implementation, which could theoretically be the wrong thing on older hypervisor versions (though the chance to have an instance launched on such an older host version is likely to get smaller every day). At least it is the same risk as we have now and the lockups happened on newer hypervisor versions. So I would tend towards the v3 solution but for that it would be good to have more hours testing with v3 to see it is not showing other problems that might be related to this change. Normally the process to get a change into an official kernel means to propose it for SRU (stable release update), I will propose the patches for inclusion and when accepted those get into a proposed kernel. Normally those are prepared and made available and then verification has to be done within a week. Which is not working with a bug like this. But if there is a reasonable confidence that a test kernel has been running on your busy instances without the original issue and new stability problems, this should be a good argument. Since the time I build the current v3 kernels there have been other updates, too. So I would go ahead and prepare a new set of those. I will post here when those are ready. If you then could start migrating your instances to those and report back here when you feel confident about the stability. Then I would start the steps required to integrate the changes into the official kernels. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
Newer kernels have been uploaded to same place as before. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
Stefan, We've collected enough instance hours on the v1 kernel to feel confident that it is not suffering the deadlock issue. We are continuing to roll over our affected production instances to it. We have done basic testing on v3 but we haven't collected enough production data on it yet to report anything. Can you help me understand the trajectory of these patches for our long term planning? Is there any indication of when v1 or v3 would land in an official linux -ec2 release? What can we do to help the most here? Collect significant instance hours on v3? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
Matt, any progress in testing the latest (v3) kernels that I provided? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
I've never been able to reproduce the problem with synthetic workloads. I've asked customers that experience the lockup regularly to test the v3 builds in an environment that won't cause production problems, but haven't received results. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
Ah, ok. Thanks. We'll have to wait then. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on multiple EC2 instance types
This has also been observed on c1.xlarge, adjusting the summary ** Summary changed: - Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance + Kernel deadlock in scheduler on multiple EC2 instance types -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on multiple EC2 instance types To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs