[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance

2012-02-29 Thread Stefan Bader
Yes, this value is used right now. The question was whether this could be moved 
by now (depending on the AWS rollout status). But anyway, I changed the patch 
to activate ticket spinlocks even when compiled for 3.0.2 or higher. Which is 
what we would be the same situation we have right now, just with the code fixes.
Please give those v3 kernels some testing and let me know how those are 
behaving. Thanks.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/929941

Title:
  Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance

2012-02-24 Thread Matt Wilson
The required  CONFIG_XEN_COMPAT value for ec2 is documented here:
http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/AdvancedUsers.html

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/929941

Title:
  Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance

2012-02-23 Thread Stefan Bader
Now added v2 builds which include the newer spinlock code (also pulling
in some other changes to allow it to compile) and change XEN_COMPAT to
3.2 and later. Question would be whether it is a valid assumption that
there won't be a Xen version older than 3.2 on EC2.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/929941

Title:
  Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance

2012-02-21 Thread Stefan Bader
Started to look into backporting the spinlock changes from the newer
patchset. Without changing the XEN_COMPAT this would result in a non-
ticket lock implementation (as mentioned before). Not sure how this
behaves, but maybe you want to try. I uploaded kernel packages in that
state to http://people.canonical.com/~smb/lp929941/.

Next need to find out whether it would be possible to ignore the
possible hypervisor race and enable the modified ticket code regardless
of the compat setting. But that will take a bit more time.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/929941

Title:
  Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance

2012-02-17 Thread Stefan Bader
Oops, sorry about that. The push there did not really indicate that the
repo went into such an utter state of disaster. :( It is fixed up now.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/929941

Title:
  Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance

2012-02-16 Thread Stefan Bader
Oh, completely forgot to say: the comment I was talking of shows up in
ec2-next in arch/x86/include/mach-xen/asm/spinlock_types.h.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/929941

Title:
  Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance

2012-02-16 Thread Stefan Bader
Matt,

which commit is a bit complicated to say. Basically yes, the code is a
merge between the 2.6.32 kernel code we have for 10.04 and the Xen
patches SUSE had at that point in time. The new tree I am talking was
an effort to pick the patches from a newer release and try to work out
what is missing / has changed. Which is not that simple because the
rebase their tree onto something (which Xen source I never was able to
find out) and then refresh their patchset.

If you want to see yourself, you can find the current code at:
git://kernel.ubuntu.com/ubuntu/ubuntu-lucid.git
(check out the ec2 branch) and I have pushed the results of reworking the newer 
patchset to
git://kernel.ubuntu.com/smb/ubuntu-lucid.git
into the ec2-next branch there.

And IMO we do have ticket locks. Se drivers/xen/core/spinlocks.c in the
current ec2 branch. Also the fact that you actually see interrupt counts
for the spinlock IRQ. Compiling the ec2-next (maybe a bit optimistic
name) branch and run that, you will notice that spinlock are now
directly an event channel but also do not get incremented (because
compiling with compat set to 3.0.2 disables the ticket lock code).

Ok, so at least that does rule out the hypervisor poll call to be the
problem and we can go forward from there. And to repeat the answer to
your last question: yes based on SUSE. Be careful when reading code in
the ec2 tree. Is is a bit of a pain because it still contains all of the
2.6.32 upstream xen components, plus the SUSE (whatever xen version that
is based on). So arch/x86/xen is not used for the ec2 kernel, but
arch/x86/include/mach-xen/asm is as are copies of x86 files with -xen to
them and some parts in drivers/xen (those pulled in by CONFIG_XEN).

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/929941

Title:
  Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance

2012-02-16 Thread Matt Wilson
$ git clone git://kernel.ubuntu.com/smb/ubuntu-lucid.git
Cloning into ubuntu-lucid...
remote: error: Could not read b43f7c4d8d293aa9f47a7094852ebd5355e4f38f
remote: fatal: Failed to traverse parents of commit 
3becab1d2df01d54a4e889cf2d69ccb902cd43c3
remote: aborting due to possible repository corruption on the remote side.
fatal: early EOF
fatal: index-pack failed

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/929941

Title:
  Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance

2012-02-15 Thread Stefan Bader
This gives me some headaches. So, I tried to figure out what would make
sense to pick from the newer code related to spinlocks. The current code
(our ec2 topic branch) seems at least to have a potentially dangerous
place in xen_spin_kick. There it only checks whether any other cpu spins
on the same lock on the top-level. If I understand the code right, they
have that chain, so it can handle a cpu to spin on a lock without
interrupts disabled and then get to spin on another one on the same cpu
in an interrupt section (which would have interrupts disabled).

However while trying to understand that whole thing, I realized that the
new code also defines a different raw_spinlock_t and in there is the
following comment:

/*
 * Xen versions prior to 3.2.x have a race condition with HYPERVISOR_poll().
 */

Checking for a XEN_COMPAT greater or equal to 3.2 there is some #define
magic which basically turns off *all* the ticket spinlock code to be
compatible with earlier hypervisors. And we compile with 3.0.2
compatibility. So if we would use that new code, spinlocks would be done
as real spinlocks again (meaning no tickets and no hypervisor / unlock
interrupt optimization).

Now, if that is true, then the observed hangs should all have happened
on a host running Xen lesser than 3.2. If not, well by the amount of
changes that are there it still leaves opportunity for having the bug in
there. And then this also opens up several paths.

a) trying to figure out the minimal change which will require likely a few 
iterations to get right.
b) take the complete new code related to spinlocks. however that will result in 
the drop of usage of ticket locks as long as we need to be compatible with xen 
3.2 and I have at least seen instances running on such hosts in the past. so 
we could as well
c) just pick the non-ticket implementation. of course that could cause some 
performance regressions
d) make sure no AWS host is running xen 3.2 anymore and pick a compat level of 
3.2 (or at least pick the spinlock code in a 
way using ticket locking because I am not really confident that changing the 
compat level overall would not have side effects)

But anyway I'd be quite interested in finding out whether the hangs are
on Xen before 3.2 or not.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/929941

Title:
  Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance

2012-02-15 Thread Matt Wilson
Stefan,

Which commit has the race condition comment? I'm aware of a problem with
SUSE's kernel with regard to PV ticketlocks and HYPERVISOR_poll(), but I
don't see any mention in upstream 3.2.x or XenLinux 2.6.18.

Your 10.04 2.6.32-era kernel doesn't have ticketlocks, so the underlying
hypervisor version should not be a factor. But for the sake of argument,
the lockups are observed on Xen hypervisors newer than 3.2.

What are you using for upstream Xen components for 2.6.32? Is it the
SUSE tree?

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/929941

Title:
  Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance

2012-02-14 Thread Stefan Bader
The most interesting part of the dmesg for me is to give a rough idea about 
what Xen version it has. And usually it is helping to make sure whether this 
correlates on all the cases where the hang happens. It looks like some 
interaction problem but the only code I can look at is the guest.
Recently there has been a fix to upstream and 3.2.y about a spinlock problem 
but it blamed a commit in 3.2 to have caused that regression. And the ec2 
kernels don't have that patch and I guess the dom0 kernel neither. Just for 
reference those would have been:

commit 84eb950db13ca40a0572ce9957e14723500943d6
  x86, ticketlock: Clean up types and accessors

for breaking and

commit 7a7546b377bdaa25ac77f33d9433c59f259b9688
  x86: xen: size struct xen_spinlock to always fit in arch_spinlock_t

But maybe we should make sure it is not something similar. I'll check
the ec2 kernel code and post the numbers here.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/929941

Title:
  Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance

2012-02-14 Thread Stefan Bader
Looking at the xen code used for the ec2 guest kernels, this is not overloading 
the generic spinlock struct with xen data. So at least that cannot overflow. 
That said, the whole xen spinlock code there is a snapshot from quite a while 
ago. And I had been working on importing a number of changes to that. But the 
result was so different from the current released code that moving forward 
seems rather scary.
First thought on all these CPUs being in the hypercall was that the 
callback/wakeup from there was failing. But there is also the possibility that 
somehow the notification about releasing the lock is not sent. The code uses 
some sort of a stacking list and maybe the workload you found has a better 
chance of getting that messed up...
Not sure what the best way to go forward would be. Trying to isolate the 
spinlock related changes from the big update then try those or just have a 
recent build of the big update and try that. The first option takes more time 
and probably iterations while the latter may bring other problems.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/929941

Title:
  Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 929941] Re: Kernel deadlock in scheduler on m2.{2, 4}xlarge EC2 instance

2012-02-13 Thread Matt Wilson
** Attachment added: /proc/interrupts as an attachment
   
https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+attachment/2736482/+files/proc-interrupts.txt

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to the bug report.
https://bugs.launchpad.net/bugs/929941

Title:
  Kernel deadlock in scheduler on m2.{2,4}xlarge EC2 instance

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/929941/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs