[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance
Yes, this value is used right now. The question was whether this could be moved by now (depending on the AWS rollout status). But anyway, I changed the patch to activate ticket spinlocks even when compiled for 3.0.2 or higher. Which is what we would be the same situation we have right now, just with the code fixes. Please give those v3 kernels some testing and let me know how those are behaving. Thanks. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance
The required CONFIG_XEN_COMPAT value for ec2 is documented here: http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/AdvancedUsers.html -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance
Now added v2 builds which include the newer spinlock code (also pulling in some other changes to allow it to compile) and change XEN_COMPAT to 3.2 and later. Question would be whether it is a valid assumption that there won't be a Xen version older than 3.2 on EC2. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance
Started to look into backporting the spinlock changes from the newer patchset. Without changing the XEN_COMPAT this would result in a non- ticket lock implementation (as mentioned before). Not sure how this behaves, but maybe you want to try. I uploaded kernel packages in that state to http://people.canonical.com/~smb/lp929941/. Next need to find out whether it would be possible to ignore the possible hypervisor race and enable the modified ticket code regardless of the compat setting. But that will take a bit more time. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance
Oops, sorry about that. The push there did not really indicate that the repo went into such an utter state of disaster. :( It is fixed up now. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance
Oh, completely forgot to say: the comment I was talking of shows up in ec2-next in arch/x86/include/mach-xen/asm/spinlock_types.h. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance
Matt, which commit is a bit complicated to say. Basically yes, the code is a merge between the 2.6.32 kernel code we have for 10.04 and the Xen patches SUSE had at that point in time. The new tree I am talking was an effort to pick the patches from a newer release and try to work out what is missing / has changed. Which is not that simple because the rebase their tree onto something (which Xen source I never was able to find out) and then refresh their patchset. If you want to see yourself, you can find the current code at: git://kernel.ubuntu.com/ubuntu/ubuntu-lucid.git (check out the ec2 branch) and I have pushed the results of reworking the newer patchset to git://kernel.ubuntu.com/smb/ubuntu-lucid.git into the ec2-next branch there. And IMO we do have ticket locks. Se drivers/xen/core/spinlocks.c in the current ec2 branch. Also the fact that you actually see interrupt counts for the spinlock IRQ. Compiling the ec2-next (maybe a bit optimistic name) branch and run that, you will notice that spinlock are now directly an event channel but also do not get incremented (because compiling with compat set to 3.0.2 disables the ticket lock code). Ok, so at least that does rule out the hypervisor poll call to be the problem and we can go forward from there. And to repeat the answer to your last question: yes based on SUSE. Be careful when reading code in the ec2 tree. Is is a bit of a pain because it still contains all of the 2.6.32 upstream xen components, plus the SUSE (whatever xen version that is based on). So arch/x86/xen is not used for the ec2 kernel, but arch/x86/include/mach-xen/asm is as are copies of x86 files with -xen to them and some parts in drivers/xen (those pulled in by CONFIG_XEN). -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance
$ git clone git://kernel.ubuntu.com/smb/ubuntu-lucid.git Cloning into ubuntu-lucid... remote: error: Could not read b43f7c4d8d293aa9f47a7094852ebd5355e4f38f remote: fatal: Failed to traverse parents of commit 3becab1d2df01d54a4e889cf2d69ccb902cd43c3 remote: aborting due to possible repository corruption on the remote side. fatal: early EOF fatal: index-pack failed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance
This gives me some headaches. So, I tried to figure out what would make sense to pick from the newer code related to spinlocks. The current code (our ec2 topic branch) seems at least to have a potentially dangerous place in xen_spin_kick. There it only checks whether any other cpu spins on the same lock on the top-level. If I understand the code right, they have that chain, so it can handle a cpu to spin on a lock without interrupts disabled and then get to spin on another one on the same cpu in an interrupt section (which would have interrupts disabled). However while trying to understand that whole thing, I realized that the new code also defines a different raw_spinlock_t and in there is the following comment: /* * Xen versions prior to 3.2.x have a race condition with HYPERVISOR_poll(). */ Checking for a XEN_COMPAT greater or equal to 3.2 there is some #define magic which basically turns off *all* the ticket spinlock code to be compatible with earlier hypervisors. And we compile with 3.0.2 compatibility. So if we would use that new code, spinlocks would be done as real spinlocks again (meaning no tickets and no hypervisor / unlock interrupt optimization). Now, if that is true, then the observed hangs should all have happened on a host running Xen lesser than 3.2. If not, well by the amount of changes that are there it still leaves opportunity for having the bug in there. And then this also opens up several paths. a) trying to figure out the minimal change which will require likely a few iterations to get right. b) take the complete new code related to spinlocks. however that will result in the drop of usage of ticket locks as long as we need to be compatible with xen 3.2 and I have at least seen instances running on such hosts in the past. so we could as well c) just pick the non-ticket implementation. of course that could cause some performance regressions d) make sure no AWS host is running xen 3.2 anymore and pick a compat level of 3.2 (or at least pick the spinlock code in a way using ticket locking because I am not really confident that changing the compat level overall would not have side effects) But anyway I'd be quite interested in finding out whether the hangs are on Xen before 3.2 or not. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance
Stefan, Which commit has the race condition comment? I'm aware of a problem with SUSE's kernel with regard to PV ticketlocks and HYPERVISOR_poll(), but I don't see any mention in upstream 3.2.x or XenLinux 2.6.18. Your 10.04 2.6.32-era kernel doesn't have ticketlocks, so the underlying hypervisor version should not be a factor. But for the sake of argument, the lockups are observed on Xen hypervisors newer than 3.2. What are you using for upstream Xen components for 2.6.32? Is it the SUSE tree? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance
The most interesting part of the dmesg for me is to give a rough idea about what Xen version it has. And usually it is helping to make sure whether this correlates on all the cases where the hang happens. It looks like some interaction problem but the only code I can look at is the guest. Recently there has been a fix to upstream and 3.2.y about a spinlock problem but it blamed a commit in 3.2 to have caused that regression. And the ec2 kernels don't have that patch and I guess the dom0 kernel neither. Just for reference those would have been: commit 84eb950db13ca40a0572ce9957e14723500943d6 x86, ticketlock: Clean up types and accessors for breaking and commit 7a7546b377bdaa25ac77f33d9433c59f259b9688 x86: xen: size struct xen_spinlock to always fit in arch_spinlock_t But maybe we should make sure it is not something similar. I'll check the ec2 kernel code and post the numbers here. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance
Looking at the xen code used for the ec2 guest kernels, this is not overloading the generic spinlock struct with xen data. So at least that cannot overflow. That said, the whole xen spinlock code there is a snapshot from quite a while ago. And I had been working on importing a number of changes to that. But the result was so different from the current released code that moving forward seems rather scary. First thought on all these CPUs being in the hypercall was that the callback/wakeup from there was failing. But there is also the possibility that somehow the notification about releasing the lock is not sent. The code uses some sort of a stacking list and maybe the workload you found has a better chance of getting that messed up... Not sure what the best way to go forward would be. Trying to isolate the spinlock related changes from the big update then try those or just have a recent build of the big update and try that. The first option takes more time and probably iterations while the latter may bring other problems. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance
** Attachment added: /proc/interrupts as an attachment https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+attachment/2736482/+files/proc-interrupts.txt -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/929941 Title: Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs